“Copy/Paste is the mother of learning.”
“Repetition! Repetition is the mother of learning.”
Sources: GitHub | Google Drive | OneDrive
Assumption: Working directory has sub-folders named "data", "images", "code", "docs".
# #R Version
R.version.string
## [1] "R version 4.1.2 (2021-11-01)"# #Working Directory
getwd()
## [1] "D:/Analytics/xADSM"# #Version information about R, the OS and attached or loaded packages
sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19042)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_India.1252 LC_CTYPE=English_India.1252 LC_MONETARY=English_India.1252
## [4] LC_NUMERIC=C LC_TIME=English_India.1252
##
## attached base packages:
## [1] compiler grid stats graphics grDevices datasets utils methods base
##
## other attached packages:
## [1] arulesViz_1.5-1 arules_1.7-1 cluster_2.1.2 factoextra_1.0.7
## [5] rfm_0.2.2 stringi_1.7.5 mlbench_2.1-3 glmnet_4.1-3
## [9] Matrix_1.3-4 caret_6.0-90 lattice_0.20-45 RColorBrewer_1.1-2
## [13] fastDummies_1.6.3 psych_2.1.9 scales_1.1.1 viridisLite_0.4.0
## [17] corrplot_0.92 GGally_2.1.2 microbenchmark_1.4.9 ggpmisc_0.4.4
## [21] ggpp_0.4.2 qcc_2.7 VIM_6.1.1 colorspace_2.0-2
## [25] mice_3.13.0 nortest_1.0-4 Lahman_9.0-0 gapminder_0.3.0
## [29] nycflights13_1.0.2 gifski_1.4.3-1 data.table_1.14.2 zoo_1.8-9
## [33] car_3.0-12 carData_3.0-4 lubridate_1.8.0 e1071_1.7-9
## [37] latex2exp_0.5.0 readxl_1.3.1 kableExtra_1.3.4 forcats_0.5.1
## [41] stringr_1.4.0 dplyr_1.0.7 purrr_0.3.4 readr_2.1.0
## [45] tidyr_1.1.4 tibble_3.1.6 ggplot2_3.3.5 conflicted_1.0.4
##
## loaded via a namespace (and not attached):
## [1] backports_1.3.0 systemfonts_1.0.3 igraph_1.2.8 plyr_1.8.6
## [5] sp_1.4-6 splines_4.1.2 listenv_0.8.0 digest_0.6.28
## [9] foreach_1.5.1 htmltools_0.5.2 viridis_0.6.2 fansi_0.5.0
## [13] magrittr_2.0.1 tzdb_0.2.0 graphlayouts_0.7.2 recipes_0.1.17
## [17] globals_0.14.0 gower_0.2.2 matrixStats_0.61.0 svglite_2.0.0
## [21] rvest_1.0.2 ggrepel_0.9.1 xfun_0.28 crayon_1.4.2
## [25] jsonlite_1.7.2 survival_3.2-13 iterators_1.0.13 glue_1.5.0
## [29] polyclip_1.10-0 gtable_0.3.0 ipred_0.9-12 webshot_0.5.2
## [33] MatrixModels_0.5-0 future.apply_1.8.1 shape_1.4.6 DEoptimR_1.0-9
## [37] abind_1.4-5 SparseM_1.81 DBI_1.1.1 Rcpp_1.0.7
## [41] laeken_0.5.2 tmvnsim_1.0-2 proxy_0.4-26 stats4_4.1.2
## [45] lava_1.6.10 prodlim_2019.11.13 vcd_1.4-9 httr_1.4.2
## [49] ellipsis_0.3.2 farver_2.1.0 pkgconfig_2.0.3 reshape_0.8.8
## [53] nnet_7.3-16 sass_0.4.0 utf8_1.2.2 tidyselect_1.1.1
## [57] rlang_0.4.12 reshape2_1.4.4 munsell_0.5.0 cellranger_1.1.0
## [61] tools_4.1.2 cachem_1.0.6 generics_0.1.1 ranger_0.13.1
## [65] broom_0.7.10 evaluate_0.14 fastmap_1.1.0 yaml_2.2.1
## [69] ModelMetrics_1.2.2.2 knitr_1.36 tidygraph_1.2.0 robustbase_0.93-9
## [73] ggraph_2.0.5 future_1.23.0 nlme_3.1-153 quantreg_5.86
## [77] xml2_1.3.2 rstudioapi_0.13 tweenr_1.0.2 bslib_0.3.1
## [81] vctrs_0.3.8 pillar_1.6.4 lifecycle_1.0.1 lmtest_0.9-39
## [85] jquerylib_0.1.4 conquer_1.2.1 R6_2.5.1 bookdown_0.24
## [89] gridExtra_2.3 parallelly_1.29.0 codetools_0.2-18 boot_1.3-28
## [93] MASS_7.3-54 assertthat_0.2.1 withr_2.4.2 mnormt_2.0.2
## [97] parallel_4.1.2 hms_1.1.1 rpart_4.1-15 timeDate_3043.102
## [101] class_7.3-19 rmarkdown_2.11 ggforce_0.3.3 pROC_1.18.0# #Pandoc Version being used by RStudio
rmarkdown::pandoc_version()
## [1] '2.14.0.3'I wanted to have a single document containing Notes, Codes, & Output for a quick reference for the lectures. Combination of multiple file formats (docx, csv, xlsx, R, png etc.) was not working out for me. So, I found the Bookdown package to generate this HTML file.
All of us had to stumble through some of the most common problems individually and as we are approaching deeper topics, a more collaborative approach might be more beneficial.
Further, the lectures are highly focused and thus I had to explore some sidetopics in more details to get the most benefit from them. I have included those topics and I am also interested in knowing about your experiences too.
Towards that goal, I am sharing these notes and hoping that you would run the code in your own environment and would raise any queries, problems, or difference in outcomes. Any suggestion or criticism is welcome. I have tried to not produce any significant changes in your working environment. Please let me know if you observe otherwise.
Currently, my priority is to get in sync with the ongoing lectures. The time constraint has led to issues given below. These will be corrected as and when possible.
Last, but not the least, I am also learning while creating this, so if you think I am wrong somewhere, please point it out. I am always open for suggestions.
Thank You all for the encouragement.
Shivam
R is Case-sensitive i.e. c() not
C()and View() notview()
Hash Sign “#” comments out anything after it, till the newline. There are no multiline comments.
Backslash “\” is reserved to escape the character that follows it.
Escape key stops the parser i.e. “+” sign where R is waiting for more input before evaluation.
Overview
Execute the current expression in Source Pane (Top) by ‘Run’ Button or "Ctrl+ Enter"
Execute the current expression in Console Pane (Bottom) by “Enter”
Windows 10 uses backslash “\” for PATH. R, however, uses slash “/.” Backslash “\” is escape character in R.
In R Studio, Set Working Directory by:
# #Current Working Directory
getwd()
## [1] "D:/Analytics/xADSM"
#
# #R Installation Directory (Old DOS Convention i.e. ~1 after 6 letters)
R.home()
## [1] "C:/PROGRA~1/R/R-41~1.2"
Sys.getenv("R_HOME")
## [1] "C:/PROGRA~1/R/R-41~1.2"
#
# #This is Wrapped in IF Block to prevent accidental execution
if(FALSE){
# #WARNING: This will change your Working Directory
setwd("~")
}If the R program is written over the console, line by line, then the output is printed automatically i.e. no function needed for printing. This is called implicit printing.
Inside an R Script File, implicit printing does not work and the expression needs to be printed explicitly.
In R, the most common method to print the output ‘explicitly’ is by the function print().
# #Implicit Printing: This will NOT be printed to Console, if it is inside an R Script.
"Hello World!"
#
# #Implicit Printing using '()': Same as above
("Hello World!")
#
# #Explicit Printing using print() : To print Objects to Console, even inside an R Script.
print("Hello World!")
## [1] "Hello World!"Everything that exists in R is an object in the sense that it is a kind of data structure that can be manipulated. Expressions for evaluation are themselves objects; Evaluation consists of taking the object representing an expression and returning the object that is the value of that expression.
# #ls(): List ALL Objects in the Current NameSpace (Environment)
ls()
## character(0)Caution: Always use “<-” for the assignment, NOT the “=”
While the “=” can be used for assignment, its usage for assignment is highly discouraged because it may behave differently under certain subtle conditions which are difficult to debug. Convention is to use “=” only during function calls for arguments association (syntactic token).
There are 5 assignment operators (<-, =, <<-, ->, ->>), others are not going to be discussed for now.
All the created objects are listed in the Environment Tab of the Top Right Pane.
# #Assignment Operator "<-" is used to assign any value (ex: 10) to any object (ex: 'bb')
bb <- 10
#
# #Print Object
print(bb)
## [1] 10In the Environment Tab, any object can be selected and deleted using Brush.
# #Trying to Print an Object 'bb' and Handling the Error, if thrown
tryCatch(print(bb), error = function(e) print(paste0(e)))
## [1] 10
#
# #Remove an Object
rm(bb)
#
# #Equivalent
if(FALSE) {rm("bb")} #Same
if(FALSE) {rm(list = "bb")} #Faster, verbose, and would not work without quotes
#
# #Trying to Print an Object 'bb' and Handling the Error, if thrown
tryCatch(print(bb), error = function(e) print(paste0(e)))
## [1] "Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'print': object 'bb' not found\n"21.1 Data are the facts and figures collected, analysed, and summarised for presentation and interpretation.
21.2 Elements are the entities on which data are collected. (Generally ROWS)
21.3 A variable is a characteristic of interest for the elements. (Generally COLUMNS)
21.4 The set of measurements obtained for a particular element is called an observation.
21.5 Statistics is the art and science of collecting, analysing, presenting, and interpreting data.
R has 6 basic data types (logical, integer, double, character, complex, and raw). These data types can be combined to form Data Structures (vector, list, matrix, dataframe, factor etc.). Refer What is a Vector!
Atomic vectors are homogeneous i.e. each component has the same datatype. A vector type can be checked with the typeof() or class() function. Its length, i.e. the number of elements in the vector, can be checked with the function length().
If the output of an expression does not show numbers in brackets like ‘[1]’ then it is a ’NULL’ type return. [Numbers] show that it is a Vector. Ex: str() and cat() outputs are of NULL Type.
Use function c() to create a vector (or a list) -
NULL < raw < logical < integer < double < complex < character < list < expression.Caution: Colon “:” might produce unexpected length of vectors (in case of 0-length vectors). Suggestion: Use colon only with hardcoded numbers i.e. “1:10” is ok, “1:n” is dangerous and should be avoided.
Caution: seq() function might produce unexpected type of vectors (in case of 1-length vectors). Suggestion: Use seq_along(), seq_len().
# #To know about an Object: str(), class(), length(), dim(), typeof(), is(), attributes(), names()
# #Integer: To declare as integer "L" (NOT "l") is needed
ii_int <- c(1L, 2L, 3L, 4L, 5L)
str(ii_int)
## int [1:5] 1 2 3 4 5
#
# #Double (& Default)
dd_dbl <- c(1, 2, 3, 4, 5)
str(dd_dbl)
## num [1:5] 1 2 3 4 5
#
# #Character
cc_chr <- c('a', 'b', 'c', 'd', 'e')
str(cc_chr)
## chr [1:5] "a" "b" "c" "d" "e"
#
# #Logical
ll_lgl <- c(TRUE, FALSE, FALSE, TRUE, TRUE)
str(ll_lgl)
## logi [1:5] TRUE FALSE FALSE TRUE TRUE# #Integer Vector of Length 1
nn <- 5L
#
# #Colon ":" Operator - Avoid its usage
str(c(1:nn))
## int [1:5] 1 2 3 4 5
c(typeof(pi:6), typeof(6:pi))
## [1] "double" "integer"
#
# #seq() - Avoid its usage
str(seq(1, nn))
## int [1:5] 1 2 3 4 5
str(seq(1, nn, 1))
## num [1:5] 1 2 3 4 5
str(seq(1, nn, 1L))
## num [1:5] 1 2 3 4 5
str(seq(1L, nn, 1L))
## int [1:5] 1 2 3 4 5
#
# #seq_len()
str(seq_len(nn))
## int [1:5] 1 2 3 4 5str(seq(1, 5, 1))
## num [1:5] 1 2 3 4 5str(letters[1:5])
## chr [1:5] "a" "b" "c" "d" "e"str(1:5 %% 2 == 0)
## logi [1:5] FALSE TRUE FALSE TRUE FALSE# #Create Two Vectors
income <- c(100, 200, 300, 400, 500)
gender <- c("male", "female", "female", "female", "male")
#
# #Create a DataFrame
bb <- data.frame(income, gender)
#
# #Print or View DataFrame
#View(bb)
print(bb)
## income gender
## 1 100 male
## 2 200 female
## 3 300 female
## 4 400 female
## 5 500 male
#
# #Struture
str(bb)
## 'data.frame': 5 obs. of 2 variables:
## $ income: num 100 200 300 400 500
## $ gender: chr "male" "female" "female" "female" ...
#
# #Names
names(bb)
## [1] "income" "gender"R Script file extension is “.R”
"Ctrl+ S" will Open Save Window at Working Directory.
"Ctrl+ O" will Open the Browse Window at Working Directory.
# #Subdirectory "data" has data files like .csv .rds .txt .xlsx
# #Subdirectory "code" has scripts files like .R
# #Subdirectory "images" has images like .png
#
# #Check if a File exists
path_relative <- "data/aa.xlsx" #Relative Path
#
if(file.exists(path_relative)) {
cat("File Exists\n")
} else {
cat(paste0("File does not exist at ", getwd(), "/", path_relative, "\n"))
}
## File Exists
#
if(exists("XL", envir = .z)) {
cat(paste0("Absolute Path exists as: ", .z$XL, "\n"))
path_absolute <- paste0(.z$XL, "aa", ".xlsx") #Absolute Path
#
if(file.exists(path_absolute)) {
cat("File Exists\n")
} else {
cat(paste0("File does not exist at ", path_absolute, "\n"))
}
} else {
cat(paste0("Object 'XL' inside Hidden Environment '.z' does not exist. \n",
"It is probably File Path of the Author, Replace the File Path from Your own Directory\n"))
}
## Absolute Path exists as: D:/Analytics/xADSM/data/
## File Existswrite.csv() and read.csv() combination can be used to export data and import it back into R. But, it has some limitations -
str(bb)
## 'data.frame': 5 obs. of 2 variables:
## $ income: num 100 200 300 400 500
## $ gender: chr "male" "female" "female" "female" ...
#
xx_data <- bb
#
# #Write a dataframe to a CSV File
write.csv(xx_data, "data/B09_xx_data.csv")
#
# #Read from the CSV into a dataframe
yy_data <- read.csv("data/B09_xx_data.csv")
#
# #Check if the object being read is same as the obejct that was written
identical(xx_data, yy_data)
## [1] FALSE# #Exercise to show how to match the objects being imported /exported from CSV
str(bb)
## 'data.frame': 5 obs. of 2 variables:
## $ income: num 100 200 300 400 500
## $ gender: chr "male" "female" "female" "female" ...
xx_data <- bb
# #Write to CSV
write.csv(xx_data, "data/B09_xx_data.csv")
#
# #Read from CSV by providing row.names Column and colClasses()
yy_data <- read.csv("data/B09_xx_data.csv", row.names = 1,
colClasses = c('character', 'numeric', 'character'))
#
# #Coerce row.names attribute to integer
attr(yy_data, "row.names") <- as.integer(attr(yy_data, "row.names"))
#
# #Check if the objects are identical
identical(xx_data, yy_data)
## [1] TRUE
stopifnot(identical(xx_data, yy_data))str(bb)
## 'data.frame': 5 obs. of 2 variables:
## $ income: num 100 200 300 400 500
## $ gender: chr "male" "female" "female" "female" ...
xx_data <- bb
#
# #Save the Object as RDS File
saveRDS(xx_data, "data/B09_xx_data.rds")
#
# #Read from the RDS File
yy_data <- readRDS("data/B09_xx_data.rds")
#
# #Objects are identical (No additional transformations are needed)
identical(xx_data, yy_data)
## [1] TRUEstr(xx_data)
## 'data.frame': 5 obs. of 2 variables:
## $ income: num 100 200 300 400 500
## $ gender: chr "male" "female" "female" "female" ...
# #Adding a Column to a dataframe
xx_data <- data.frame(xx_data, age = 22:26)
#
# #Adding a Column to a dataframe by adding a Vector
x_age <- 22:26
xx_data <- data.frame(xx_data, x_age)
str(xx_data)
## 'data.frame': 5 obs. of 4 variables:
## $ income: num 100 200 300 400 500
## $ gender: chr "male" "female" "female" "female" ...
## $ age : int 22 23 24 25 26
## $ x_age : int 22 23 24 25 26
#
# #Adding a Column to a dataframe by using dollar "$"
xx_data$age1 <- x_age
#
# #Adding a Blank Column using NA
xx_data$blank <- NA
#
# #Editing of a dataframe can also be done
# edit(xx_data)
str(xx_data)
## 'data.frame': 5 obs. of 6 variables:
## $ income: num 100 200 300 400 500
## $ gender: chr "male" "female" "female" "female" ...
## $ age : int 22 23 24 25 26
## $ x_age : int 22 23 24 25 26
## $ age1 : int 22 23 24 25 26
## $ blank : logi NA NA NA NA NA
#
# #Removing a Column by subsetting
xx_data <- xx_data[ , -c(3)]
#
# #Removing a Column using NULL
xx_data$age1 <- NULL
str(xx_data)
## 'data.frame': 5 obs. of 4 variables:
## $ income: num 100 200 300 400 500
## $ gender: chr "male" "female" "female" "female" ...
## $ x_age : int 22 23 24 25 26
## $ blank : logi NA NA NA NA NAPackages include reusable functions, the documentation that describes how to use them, and sample data.
In R Studio: Packages Tab | Install | Package Name = “psych” | Install
if(FALSE){
# #WARNING: This will install packages and R Studio will NOT work for that duration
# #Install Packages and their dependencies
install.packages("psych", dependencies = TRUE)
}# #Load a Package with or without Quotes
library(readxl)
library("readr")# #Load Multiple Packages
pkg_chr <- c("ggplot2", "tibble", "tidyr", "readr", "dplyr")
#lapply(pkg_chr, FUN = function(x) {library(x, character.only = TRUE)})
#
# #Load Multiple Packages, Suppress Startup Messages, and No console output
invisible(lapply(pkg_chr, FUN = function(x) {
suppressMessages(library(x, character.only = TRUE))}))# #Detach a package
#detach("package:psych", unload = TRUE)
#
# #Search Package in the already loaded packages
pkg_chr <- "psych"
if (pkg_chr %in% .packages()) {
# #Detach a package that has been loaded previously
detach(paste0("package:", pkg_chr), character.only = TRUE, unload = TRUE)
}To Import Excel in R Studio : Environment | Dropdown | From Excel | Browse
Object imported by read.csv() i.e. ‘mydata’ is NOT same as the one imported by read_excel() i.e. ‘mydata_xl’
All of these objects can be converted into any other form as needed i.e. dataframe to tibble or vice-versa.
# #Data File Name has been modified to include lecture number "B09"
# #All Data Files are in the sub-directory named 'data'
mydata <- read.csv("data/B09-FLIGHTS.csv")
#
# #To Copy from Clipboard, assuming copied from xlsx i.e. tab separated data
mydata_clip <- read.csv("clipboard", sep = '\t', header = TRUE)# #Following Setup allows us to read CSV only once and then create an RDS file
# #Its advantage is in terms of faster loading time and lower memory requirment
xx_csv <- paste0("data/", "B09-FLIGHTS", ".csv")
xx_rds <- paste0("data/", "b09_flights", ".rds")
b09_flights <- NULL
if(file.exists(xx_rds)) {
b09_flights <- readRDS(xx_rds)
} else {
# #Read CSV
b09_flights <- read.csv(xx_csv)
# #Write Object as RDS
saveRDS(b09_flights, xx_rds)
}
rm(xx_csv, xx_rds)
mydata <- b09_flightsstr(mydata)
## 'data.frame': 336776 obs. of 19 variables:
## $ year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_time : int 517 533 542 544 554 554 555 557 557 558 ...
## $ sched_dep_time: int 515 529 540 545 600 558 600 600 600 600 ...
## $ dep_delay : int 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
## $ arr_time : int 830 850 923 1004 812 740 913 709 838 753 ...
## $ sched_arr_time: int 819 830 850 1022 837 728 854 723 846 745 ...
## $ arr_delay : int 11 20 33 -18 -25 12 19 -14 -8 8 ...
## $ carrier : chr "UA" "UA" "AA" "B6" ...
## $ flight : int 1545 1714 1141 725 461 1696 507 5708 79 301 ...
## $ tailnum : chr "N14228" "N24211" "N619AA" "N804JB" ...
## $ origin : chr "EWR" "LGA" "JFK" "JFK" ...
## $ dest : chr "IAH" "IAH" "MIA" "BQN" ...
## $ air_time : int 227 227 160 183 116 150 158 53 140 138 ...
## $ distance : int 1400 1416 1089 1576 762 719 1065 229 944 733 ...
## $ hour : int 5 5 5 5 6 5 6 6 6 6 ...
## $ minute : int 15 29 40 45 0 58 0 0 0 0 ...
## $ time_hour : chr "2013-01-01 05:00:00" "2013-01-01 05:00:00" "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...head(mydata)
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight
## 1 2013 1 1 517 515 2 830 819 11 UA 1545
## 2 2013 1 1 533 529 4 850 830 20 UA 1714
## 3 2013 1 1 542 540 2 923 850 33 AA 1141
## 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725
## 5 2013 1 1 554 600 -6 812 837 -25 DL 461
## 6 2013 1 1 554 558 -4 740 728 12 UA 1696
## tailnum origin dest air_time distance hour minute time_hour
## 1 N14228 EWR IAH 227 1400 5 15 2013-01-01 05:00:00
## 2 N24211 LGA IAH 227 1416 5 29 2013-01-01 05:00:00
## 3 N619AA JFK MIA 160 1089 5 40 2013-01-01 05:00:00
## 4 N804JB JFK BQN 183 1576 5 45 2013-01-01 05:00:00
## 5 N668DN LGA ATL 116 762 6 0 2013-01-01 06:00:00
## 6 N39463 EWR ORD 150 719 5 58 2013-01-01 05:00:00tail(mydata)
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
## 336771 2013 9 30 NA 1842 NA NA 2019 NA EV
## 336772 2013 9 30 NA 1455 NA NA 1634 NA 9E
## 336773 2013 9 30 NA 2200 NA NA 2312 NA 9E
## 336774 2013 9 30 NA 1210 NA NA 1330 NA MQ
## 336775 2013 9 30 NA 1159 NA NA 1344 NA MQ
## 336776 2013 9 30 NA 840 NA NA 1020 NA MQ
## flight tailnum origin dest air_time distance hour minute time_hour
## 336771 5274 N740EV LGA BNA NA 764 18 42 2013-09-30 18:00:00
## 336772 3393 <NA> JFK DCA NA 213 14 55 2013-09-30 14:00:00
## 336773 3525 <NA> LGA SYR NA 198 22 0 2013-09-30 22:00:00
## 336774 3461 N535MQ LGA BNA NA 764 12 10 2013-09-30 12:00:00
## 336775 3572 N511MQ LGA CLE NA 419 11 59 2013-09-30 11:00:00
## 336776 3531 N839MQ LGA RDU NA 431 8 40 2013-09-30 08:00:00# #library(readxl)
mydata_xl <- read_excel("data/B09-FLIGHTS.xlsx", sheet = "FLIGHTS")# #library(readxl)
xx_xl <- paste0("data/", "B09-FLIGHTS", ".xlsx")
xx_rds_xl <- paste0("data/", "b09_flights_xls", ".rds")
b09_flights_xls <- NULL
if(file.exists(xx_rds_xl)) {
b09_flights_xls <- readRDS(xx_rds_xl)
} else {
b09_flights_xls <- read_excel(xx_xl, sheet = "FLIGHTS")
saveRDS(b09_flights_xls, xx_rds_xl)
}
rm(xx_xl, xx_rds_xl)
mydata_xl <- b09_flights_xls
#str(mydata_xl)
## tibble [336,776 x 19] (S3: tbl_df/tbl/data.frame)
## $ year : num [1:336776] 2013 2013 2013 2013 2013 ...
## $ month : num [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
## $ day : num [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_time : chr [1:336776] "517" "533" "542" "544" ...
## $ sched_dep_time: num [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
## $ dep_delay : chr [1:336776] "2" "4" "2" "-1" ...
## $ arr_time : chr [1:336776] "830" "850" "923" "1004" ...
## $ sched_arr_time: num [1:336776] 819 830 850 1022 837 ...
## $ arr_delay : chr [1:336776] "11" "20" "33" "-18" ...
## $ carrier : chr [1:336776] "UA" "UA" "AA" "B6" ...
## $ flight : num [1:336776] 1545 1714 1141 725 461 ...
## $ tailnum : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
## $ origin : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
## $ dest : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
## $ air_time : chr [1:336776] "227" "227" "160" "183" ...
## $ distance : num [1:336776] 1400 1416 1089 1576 762 ...
## $ hour : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
## $ minute : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
## $ time_hour : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...# #Following Setup allows us to read CSV only once and then create an RDS file
# #Its advantage is in terms of faster loading time and lower memory requirment
# #library(readr)
xx_csv <- paste0("data/", "B09-FLIGHTS", ".csv")
xx_rds <- paste0("data/", "xxflights", ".rds")
xxflights <- NULL
if(file.exists(xx_rds)) {
xxflights <- readRDS(xx_rds)
} else {
xxflights <- read_csv(xx_csv, show_col_types = FALSE)
attr(xxflights, "spec") <- NULL
attr(xxflights, "problems") <- NULL
saveRDS(xxflights, xx_rds)
}
rm(xx_csv, xx_rds)
mydata_rdr <- xxflights# #Subset All Rows and last 3 columns
data6 <- mydata[ , c(17:19)]
str(data6)
## 'data.frame': 336776 obs. of 3 variables:
## $ hour : int 5 5 5 5 6 5 6 6 6 6 ...
## $ minute : int 15 29 40 45 0 58 0 0 0 0 ...
## $ time_hour: chr "2013-01-01 05:00:00" "2013-01-01 05:00:00" "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
# #Subset by deleting the 1:16 columns
data7 <- mydata[ , -c(1:16)]
stopifnot(identical(data6, data7))Caution: Attaching a Dataset should be avoided to prevent unexpected behaviour due to ‘masking.’ Using full scope resolution i.e. ‘data_frame$column_header’ would result in fewer bugs. However, if a Dataset has been attached, please ensure that it is detached also.
Caution: If a dataset is attached more than once e.g. 4 times, please note that there will be 4 copies attached to the environment. It can be checked with search(). Each needs to be detached.
if(FALSE){
# #WARNING: Attaching a Dataset is discouraged because of 'masking'
# #'dep_time' is Column Header of a dataframe 'mydata'
tryCatch(str(dep_time), error = function(e) print(paste0(e)))
## [1] "Error in str(dep_time): object 'dep_time' not found\n"
# #Attach the Dataset
attach(mydata)
# #Now all the column headers are accessible without the $ sign
str(dep_time)
## int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
# #But, there are other datasets also, attaching another one results in MESSAGE
attach(mydata_xl)
## The following objects are masked from mydata:
##
## air_time, arr_delay, arr_time, carrier, day, dep_delay, dep_time, dest,
## distance, flight, hour, minute, month, origin, sched_arr_time,
## sched_dep_time, tailnum, time_hour, year
str(dep_time)
## chr [1:336776] "517" "533" "542" "544" "554" "554" "555" "557" "557" "558" "558" ...
#
# #'mydata_xl$dep_time' masked the already present 'mydata$dep_time'.
# #Thus now it is showing as 'chr' in place of original 'int'
# #Column Header Names can be highly varied and those will silently mask other variable
# #Hence, attaching a dataset would result in random bugs or unexpected behaviours
#
# #Detach a Dataset
detach(mydata_xl)
detach(mydata)
}Figure 1.1 Correlation using psych::pairs.panels()
# # Subset 3 Columns and 10,000 rows
x_rows <- 10000L
data_pairs <- mydata[1:x_rows, c(7, 16, 9)]
#
# #Equivalent
ii <- mydata %>%
select(air_time, distance, arr_delay) %>%
slice_head(n = x_rows)
#
stopifnot(identical(ii, data_pairs))
#
if( nrow(data_pairs) * ncol(data_pairs) > 1000000 ) {
print("Please reduce the number of points to a sane number!")
ggplot()
} else {
#B09P01
pairs.panels(data_pairs)
if(FALSE){# Cleaner Graph
pairs.panels(data_pairs, smooth = FALSE, jiggle = FALSE, rug = FALSE, ellipses = FALSE,
cex.cor = 1, cex = 1, gap = 0, main = "Title")
title(sub = "Caption", line = 4, adj = 1)
}
}These allow you to combine executable code and rich text in a single document, along with images, HTML, LaTeX and more.
An R Markdown document is written in markdown (an easy-to-write plain text format) and contains chunks of embedded R code. To know more Go to Rstudio
To know more about Google Colab Go to Google Colab
NOTE: As I am not using Google Colab, the workflow explained between 00:00 to 35:10 is NOT covered here. If someone is using Google Colab, and is willing to share their notes, I would include those.
Base R graphs /plots as shown in figure 2.1
Figure 2.1 Flights: Arrival Time (Y) vs. Departure (X) Time
# #Create a Subset of Dataframe of 1000 Rows for quick calculations
bb <- head(mydata, 1000)
#
# #Dimensions: dim() Row x Column; nrow(); ncol()
dim(bb)
## [1] 1000 19
#
stopifnot(identical(nrow(bb), dim(bb)[1]))
stopifnot(identical(ncol(bb), dim(bb)[2]))# #Split a Dataframe by subsetting
data_1 <- bb[ , 1:8]
data_2 <- bb[ , 9:19]
# str(data_1)# #Merge a Dataframe by cbind()
data_3 <- cbind(data_1, data_2)
# #Equivalent
data_4 <- data.frame(data_1, data_2)
# str(bb_3)
stopifnot(identical(data_3, data_4))# #Row Split
data_5 <- bb[1:300, ]
data_6 <- bb[301:1000, ]
#
# #Equivalent
n_rows <- 300L
data_5 <- bb[1:n_rows, ]
data_6 <- bb[(n_rows + 1L):nrow(bb), ]
#
stopifnot(identical(data_5, head(bb, n_rows)))
stopifnot(identical(data_6, tail(bb, (nrow(bb) - n_rows))))# #Merge a Dataframe by rbind()
data_7 <- rbind(data_5, data_6)
stopifnot(identical(bb, data_7))# #Change A Specific Name based on Index Ex: First Header "year" -> "YEAR"
# #NOTE: Output of 'names(bb)' is a character vector, not a dataframe
# #So, [1] is being used to subset for 1st element and NOT the [ , 1] (as done for dataframe)
(names(bb)[1] <- "YEAR")
## [1] "YEAR"
#
# #Change all Column Headers to Uppercase by toupper() or Lowercase by tolower()
names(bb) <- toupper(names(bb))NA can be coerced to any other vector type except raw. There are also constants like NA_integer_, NA_real_ etc. For checking only the presence of NA, anyNA() is faster than is.na()
Overview of ‘Not Available’
To remove all NA
bb <- xxflights
# #anyNA() is faster than is.na()
if(anyNA(bb)) print("NA are Present!") else print("NA not found")
## [1] "NA are Present!"
#
# #Columnwise NA Count
bb_na_col <- colSums(is.na(bb))
# #
bb %>% summarise(across(everything(), ~ sum(is.na(.)))) %>%
pivot_longer(everything()) %>% filter(value > 0)
## # A tibble: 6 x 2
## name value
## <chr> <int>
## 1 dep_time 8255
## 2 dep_delay 8255
## 3 arr_time 8713
## 4 arr_delay 9430
## 5 tailnum 2512
## 6 air_time 9430
#
colSums(is.na(bb)) %>% as_tibble(rownames = "Cols") %>% filter(value > 0)
## # A tibble: 6 x 2
## Cols value
## <chr> <dbl>
## 1 dep_time 8255
## 2 dep_delay 8255
## 3 arr_time 8713
## 4 arr_delay 9430
## 5 tailnum 2512
## 6 air_time 9430
#
# #Vector of Columns having NA
which(bb_na_col != 0)
## dep_time dep_delay arr_time arr_delay tailnum air_time
## 4 6 7 9 12 15
stopifnot(identical(which(bb_na_col != 0), which(vapply(bb, anyNA, logical(1)))))
#
# #Indices of Rows with NA
head(which(!complete.cases(bb)))
## [1] 472 478 616 644 726 734
#
# #How many rows contain NA
sum(!complete.cases(bb))
## [1] 9430
#
# #How many rows have NA in specific Columns
sum(!complete.cases(bb[, c(6, 9, 4)]))
## [1] 9430# #Remove all rows which have any NA
# #na.omit(), complete.cases(), tidyr::drop_na(), rowSums(is.na())
bb_1 <- na.omit(bb)
# #Print the Count of removed rows containg NA
print(paste0("Note: ", length(attributes(bb_1)$na.action), " rows removed."))
## [1] "Note: 9430 rows removed."
#
# #Remove additional Attribute added by na.omit()
attr(bb_1, "na.action") <- NULL
#
# #Equivalent
bb_2 <- bb[complete.cases(bb), ]
bb_3 <- bb %>% drop_na()
bb_4 <- bb[rowSums(is.na(bb)) == 0, ]
#Validation
stopifnot(all(identical(bb_1, bb_2), identical(bb_1, bb_3), identical(bb_1, bb_4)))
#
# #complete.cases also allow partial selection of specific columns
# #Remove rows which have NA in some columns i.e. ignore NA in other columns
dim(bb[complete.cases(bb[ , c(6, 9, 4)]), ])
## [1] 327346 19
# #Equivalent
dim(bb %>% drop_na(dep_delay, arr_delay, dep_time))
## [1] 327346 19
#
# #Remove rows which have more than allowed number of NA (ex:4) in any column
# #Caution: In general, this is not recommended because random columns retain NA
dim(bb[rowSums(is.na(bb)) <= 4L, ])
## [1] 328521 19Sources: (SO) Grouping Functions and the Apply Family, (SO) Why is vapply safer than sapply, Hadley - Advanced R - Functionals, This, This, & This
Apply Function in R are designed to avoid explicit use of loop constructs.
# #Subset Dataframe
bb <- xxflights
data_8 <- bb[ , c("dep_delay", "arr_delay", "dep_time")]
#data_8 <- bb %>% select(dep_delay, arr_delay, dep_time)
#
# #Remove missing values
data_9 <- na.omit(data_8)
#
# #Calculate Columnwise Mean
(bb_1 <- apply(data_9, 2, mean))
## dep_delay arr_delay dep_time
## 12.555156 6.895377 1348.789883
bb_2 <- unlist(lapply(data_9, mean))
bb_3 <- sapply(data_9, mean)
bb_4 <- vapply(data_9, mean, numeric(1))
#
stopifnot(all(identical(bb_1, bb_2), identical(bb_1, bb_3), identical(bb_1, bb_4)))Refer The 6 Datatypes of Atomic Vectors
Create a Basic Tibble, Table2.1, for evaluating ‘is.x()’ series of functions in Base R
| ii | dd | cc | ll | ff | fo | dtm | dat |
|---|---|---|---|---|---|---|---|
| 1 | 1 | a | FALSE | odd | odd | 2022-01-31 20:17:13 | 2022-02-01 |
| 2 | 2 | b | TRUE | even | even | 2022-01-31 20:17:14 | 2022-02-02 |
| 3 | 3 | c | FALSE | odd | odd | 2022-01-31 20:17:15 | 2022-02-03 |
| 4 | 4 | d | TRUE | even | even | 2022-01-31 20:17:16 | 2022-02-04 |
| 5 | 5 | e | FALSE | odd | odd | 2022-01-31 20:17:17 | 2022-02-05 |
| 6 | 6 | f | TRUE | even | even | 2022-01-31 20:17:18 | 2022-02-06 |
# #Basic Tibble
nn <- 6L
xxbasic10 <- tibble(ii = 1:nn, dd = seq(1, nn, 1), cc = head(letters, nn),
ll = (ii %% 2) == 0, ff = factor(rep(c("odd", "even"), length.out = nn)),
fo = factor(rep(c("odd", "even"), length.out = nn), ordered = TRUE),
dtm = Sys.time() + 1:nn, dat = Sys.Date() + 1:nn)
bb <- xxbasic10
str(bb)# #Validation
# #anyNA() is TRUE if there is an NA present, FALSE otherwise
vapply(bb, anyNA, logical(1))
## ii dd cc ll ff fo dtm dat
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#
# #is.atomic() is TRUE for All Atomic Vectors, factor, matrix but NOT for list
vapply(bb, is.atomic, logical(1))
## ii dd cc ll ff fo dtm dat
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#
# #is.vector() is TRUE for All Atomic Vectors, list but NOT for factor, matrix, DATE & POSIXct
# #CAUTION: With vapply() it returns TRUE for matrix (it checks individual elements)
# #CAUTION: FALSE if the vector has attributes (except names) ex: DATE & POSIXct
vapply(bb, is.vector, logical(1))
## ii dd cc ll ff fo dtm dat
## TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
#
# #is.numeric() is TRUE for both integer and double
vapply(bb, is.numeric, logical(1))
## ii dd cc ll ff fo dtm dat
## TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
#
# #is.integer() is TRUE only for integer
vapply(bb, is.integer, logical(1))
## ii dd cc ll ff fo dtm dat
## TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#
# #is.double() is TRUE only for double
vapply(bb, is.double, logical(1))
## ii dd cc ll ff fo dtm dat
## FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE
#
# #is.character() is TRUE only for character
vapply(bb, is.character, logical(1))
## ii dd cc ll ff fo dtm dat
## FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
#
# #is.logical() is TRUE only for logical
vapply(bb, is.logical, logical(1))
## ii dd cc ll ff fo dtm dat
## FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE# #Factors
# #is.factor() is TRUE only for factor (unordered or ordered)
vapply(bb, is.factor, logical(1))
## ii dd cc ll ff fo dtm dat
## FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
#
# #is.ordered() is TRUE only for ordered factor
vapply(bb, is.ordered, logical(1))
## ii dd cc ll ff fo dtm dat
## FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
#
# #nlevels()
vapply(bb, nlevels, integer(1))
## ii dd cc ll ff fo dtm dat
## 0 0 0 0 2 2 0 0
#
# #levels()
vapply(bb, function(x) !is.null(levels(x)), logical(1))
## ii dd cc ll ff fo dtm dat
## FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
#
# #table()
table(bb$ff)
##
## even odd
## 3 3# #Package lubridate covers the missing functions for POSIXct, POSIXlt, or Date
# #is.timepoint() is TRUE for POSIXct, POSIXlt, or Date
vapply(bb, is.timepoint, logical(1))
## ii dd cc ll ff fo dtm dat
## FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
#
# #is.POSIXt() is TRUE only for POSIXct
vapply(bb, is.POSIXt, logical(1))
## ii dd cc ll ff fo dtm dat
## FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
#
# #is.Date() is only TRUE for DATE
vapply(bb, is.Date, logical(1))
## ii dd cc ll ff fo dtm dat
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE# #Which Columns have Duplicate Values
vapply(bb, function(x) anyDuplicated(x) != 0L, logical(1))
## ii dd cc ll ff fo dtm dat
## FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSEThey can store both strings and integers. They are useful in the columns which have a limited number of unique values. Like “Male, Female” and “True, False” etc. They are useful in data analysis for statistical modelling.
Factor is nothing but the numeric representation of the character vector.
as.factor() vs. factor()
str(bb$ll)
## logi [1:6] FALSE TRUE FALSE TRUE FALSE TRUE
# #Coercion to Factor
bb$new <- as.factor(bb$ll)
str(bb$new)
## Factor w/ 2 levels "FALSE","TRUE": 1 2 1 2 1 2
#
# #table()
table(bb$ll)
##
## FALSE TRUE
## 3 3
table(bb$new)
##
## FALSE TRUE
## 3 3
#
# #Levels can be Labelled differently also
str(bb$ff)
## Factor w/ 2 levels "even","odd": 2 1 2 1 2 1
# #
str(factor(bb$ff, levels = c("even", "odd"), labels = c("day", "night")))
## Factor w/ 2 levels "day","night": 2 1 2 1 2 1
str(factor(bb$ff, levels = c("odd", "even"), labels = c("day", "night")))
## Factor w/ 2 levels "day","night": 1 2 1 2 1 2
#
# #Coercion from Factor to character, logical etc.
bb$xcc <- as.character(bb$new)
bb$xll <- as.logical(bb$new)
#
str(bb)
## tibble [6 x 11] (S3: tbl_df/tbl/data.frame)
## $ ii : int [1:6] 1 2 3 4 5 6
## $ dd : num [1:6] 1 2 3 4 5 6
## $ cc : chr [1:6] "a" "b" "c" "d" ...
## $ ll : logi [1:6] FALSE TRUE FALSE TRUE FALSE TRUE
## $ ff : Factor w/ 2 levels "even","odd": 2 1 2 1 2 1
## $ fo : Ord.factor w/ 2 levels "even"<"odd": 2 1 2 1 2 1
## $ dtm: POSIXct[1:6], format: "2022-01-31 20:17:13" "2022-01-31 20:17:14" "2022-01-31 20:17:15" ...
## $ dat: Date[1:6], format: "2022-02-01" "2022-02-02" "2022-02-03" ...
## $ new: Factor w/ 2 levels "FALSE","TRUE": 1 2 1 2 1 2
## $ xcc: chr [1:6] "FALSE" "TRUE" "FALSE" "TRUE" ...
## $ xll: logi [1:6] FALSE TRUE FALSE TRUE FALSE TRUEbb <- xxflights
aa <- c("month", "day")
str(bb[aa])
## tibble [336,776 x 2] (S3: tbl_df/tbl/data.frame)
## $ month: num [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
## $ day : num [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
# #To factor
bb$day <- as.factor(bb$day)
bb$month <- as.factor(bb$month)
# #Equivalent
#bb[aa] <- lapply(bb[aa], as.factor)
str(bb[aa])
## tibble [336,776 x 2] (S3: tbl_df/tbl/data.frame)
## $ month: Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ day : Factor w/ 31 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...# #Unordered Named Vector of Fruits with Names of Colours
# #NOTE: First letters of each colour and fruit match
ii <- c("pink" = "papaya", "black" = "banana", "orchid" = "orange", "amber" = "apple")
ii
## pink black orchid amber
## "papaya" "banana" "orange" "apple"
#
# #Factor Vectors (default is Alphabetical Sorting) using unname() to remove names
fruit_base <- factor(unname(ii))
# #sort()
fruit_sort <- factor(unname(sort(ii)))
# #unique() provides the values in the sequence of their appearance
fruit_uniq <- factor(unname(ii), levels = unique(ii))
#
# #Default Levels Match even though the actual Vectors do not Match
stopifnot(identical(levels(fruit_base), levels(fruit_sort)))
fruit_base
## [1] papaya banana orange apple
## Levels: apple banana orange papaya
fruit_sort
## [1] apple banana orange papaya
## Levels: apple banana orange papaya
fruit_uniq
## [1] papaya banana orange apple
## Levels: papaya banana orange apple
#
# #Relabelling: First letters should always match between Fruits and Colours
color_base <- fruit_base
color_sort <- fruit_sort
color_uniq <- fruit_uniq
#
levels(color_base) <- names(ii)[match(color_base, ii)] #WRONG
levels(color_sort) <- names(ii)[match(color_sort, ii)]
levels(color_uniq) <- names(ii)[match(color_uniq, ii)]
#
# #CAUTION: This is WRONG.
color_base #WRONG
## [1] amber black orchid pink
## Levels: pink black orchid amber
#
color_sort
## [1] amber black orchid pink
## Levels: amber black orchid pink
color_uniq
## [1] pink black orchid amber
## Levels: pink black orchid amberCaution: The only thing you need to take care of, is that you do not give two elements the same name. R will NOT throw ERROR.
Due to the resulting two-dimensional structure, data frames can mimic some of the behaviour of matrices. You can select rows and do operations on rows. You cannot do that with lists, as a row is undefined there.
A Dataframe is intended to be used as a relational table. This means that elements in the same column are related to each other in the sense that they are all measures of the same metric. And, elements in the same row are related to each other in the sense that they are all measures from the same observation or measures of the same item. This is why when you look at the structure of a Dataframe, it will state the the number of observations and the number of variables instead of the number of rows and columns.
Dataframes are distinct from Matrices because they can include heterogenous data types among columns/variables. Dataframes do not permit multiple data types within a column/variable, for reasons that also follow from the relational table idea.
All this implies that you should use a data frame for any dataset that fits in that two-dimensional structure. Essentially, you use data frames for any dataset where a column coincides with a variable and a row coincides with a single observation in the broad sense of the word. For all other structures, lists are the way to go.
# #CAUTION: Do not Create a list with duplicate names (R will NOT throw ERROR)
bb <- list(a=1, b=2, a=3)
# # 3rd index cannot be accessed using $
bb$a
## [1] 1
identical(bb$a, bb[[1]])
## [1] TRUE
identical(bb$a, bb[[3]])
## [1] FALSE
bb[[3]]
## [1] 3# #Create a list
bb_lst <- list( a = c(1, 2), b = c('a', 'b', 'c'))
tryCatch(
# #Trying to create varying length of variables in dataframe like in list
bb_dft <- data.frame(a = c(1, 2), b = c('a', 'b', 'c')),
error = function(e) {
# #Print ERROR
cat(paste0(e))
# #Double Arrow Assignment '<<-' to assign in parent environment
bb_dft <<- data.frame(a = c(1, 2), b = c('a', 'b'))
}
)
## Error in data.frame(a = c(1, 2), b = c("a", "b", "c")): arguments imply differing number of rows: 2, 3
#
# #Both list and dataframe have same type()
typeof(bb_lst)
## [1] "list"
typeof(bb_dft)
## [1] "list"
#
# #But, class() is different for list and dataframe
class(bb_lst)
## [1] "list"
class(bb_dft)
## [1] "data.frame"
#
str(bb_lst)
## List of 2
## $ a: num [1:2] 1 2
## $ b: chr [1:3] "a" "b" "c"
str(bb_dft)
## 'data.frame': 2 obs. of 2 variables:
## $ a: num 1 2
## $ b: chr "a" "b"
#
# #Although 'bb_lst_c' is a list but inside coercion takes place i.e. '9' is character
bb_lst_c <- list( a = c(8, 'x'), b = c('y', 9))
str(bb_lst_c[[2]][2])
## chr "9"
#
# #Here, '9' is numeric, it is stored as list element so note the extra [[]]
bb_lst_l <- list( a = list(8, 'x'), b = list('y', 9))
str(bb_lst_l[[2]][[2]])
## num 9# #Create a Matrix
bb_mat <- matrix(1:6, nrow = 2, ncol = 3)
print(bb_mat)
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
str(bb_mat)
## int [1:2, 1:3] 1 2 3 4 5 6
class(bb_mat)
## [1] "matrix" "array"
typeof(bb_mat)
## [1] "integer"# #Basic Tibble
bb <- xxbasic10
str(bb)
## tibble [6 x 8] (S3: tbl_df/tbl/data.frame)
## $ ii : int [1:6] 1 2 3 4 5 6
## $ dd : num [1:6] 1 2 3 4 5 6
## $ cc : chr [1:6] "a" "b" "c" "d" ...
## $ ll : logi [1:6] FALSE TRUE FALSE TRUE FALSE TRUE
## $ ff : Factor w/ 2 levels "even","odd": 2 1 2 1 2 1
## $ fo : Ord.factor w/ 2 levels "even"<"odd": 2 1 2 1 2 1
## $ dtm: POSIXct[1:6], format: "2022-01-31 20:17:13" "2022-01-31 20:17:14" "2022-01-31 20:17:15" ...
## $ dat: Date[1:6], format: "2022-02-01" "2022-02-02" "2022-02-03" ...
# #Split with 'cc' as common ID column
bb_a <- bb[1:3]
bb_b <- bb[3:ncol(bb)]
#
# #merge() using the common ID column 'cc'
bb_new <- merge(bb_a, bb_b, by = "cc")
bb_new
## cc ii dd ll ff fo dtm dat
## 1 a 1 1 FALSE odd odd 2022-01-31 20:17:13 2022-02-01
## 2 b 2 2 TRUE even even 2022-01-31 20:17:14 2022-02-02
## 3 c 3 3 FALSE odd odd 2022-01-31 20:17:15 2022-02-03
## 4 d 4 4 TRUE even even 2022-01-31 20:17:16 2022-02-04
## 5 e 5 5 FALSE odd odd 2022-01-31 20:17:17 2022-02-05
## 6 f 6 6 TRUE even even 2022-01-31 20:17:18 2022-02-06bb <- xxflights
# #Sort ascending (default)
bb_1 <- bb[order(bb$dep_delay), ]
# #Sort descending
bb_2 <- bb[order(-bb$dep_delay), ]
#
bb[1:5, c("dep_time", "dep_delay", "tailnum", "carrier")]
## # A tibble: 5 x 4
## dep_time dep_delay tailnum carrier
## <dbl> <dbl> <chr> <chr>
## 1 517 2 N14228 UA
## 2 533 4 N24211 UA
## 3 542 2 N619AA AA
## 4 544 -1 N804JB B6
## 5 554 -6 N668DN DL
bb_1[1:5, c("dep_time", "dep_delay", "tailnum", "carrier")]
## # A tibble: 5 x 4
## dep_time dep_delay tailnum carrier
## <dbl> <dbl> <chr> <chr>
## 1 2040 -43 N592JB B6
## 2 2022 -33 N612DL DL
## 3 1408 -32 N825AS EV
## 4 1900 -30 N934DL DL
## 5 1703 -27 N208FR F9
bb_2[1:5, c("dep_time", "dep_delay", "tailnum", "carrier")]
## # A tibble: 5 x 4
## dep_time dep_delay tailnum carrier
## <dbl> <dbl> <chr> <chr>
## 1 641 1301 N384HA HA
## 2 1432 1137 N504MQ MQ
## 3 1121 1126 N517MQ MQ
## 4 1139 1014 N338AA AA
## 5 845 1005 N665MQ MQbb <- xxbasic10
bb
## # A tibble: 6 x 8
## ii dd cc ll ff fo dtm dat
## <int> <dbl> <chr> <lgl> <fct> <ord> <dttm> <date>
## 1 1 1 a FALSE odd odd 2022-01-31 20:17:13 2022-02-01
## 2 2 2 b TRUE even even 2022-01-31 20:17:14 2022-02-02
## 3 3 3 c FALSE odd odd 2022-01-31 20:17:15 2022-02-03
## 4 4 4 d TRUE even even 2022-01-31 20:17:16 2022-02-04
## 5 5 5 e FALSE odd odd 2022-01-31 20:17:17 2022-02-05
## 6 6 6 f TRUE even even 2022-01-31 20:17:18 2022-02-06
# #Sort ascending (default)
(bb_1 <- bb[order(bb$ll), ])
## # A tibble: 6 x 8
## ii dd cc ll ff fo dtm dat
## <int> <dbl> <chr> <lgl> <fct> <ord> <dttm> <date>
## 1 1 1 a FALSE odd odd 2022-01-31 20:17:13 2022-02-01
## 2 3 3 c FALSE odd odd 2022-01-31 20:17:15 2022-02-03
## 3 5 5 e FALSE odd odd 2022-01-31 20:17:17 2022-02-05
## 4 2 2 b TRUE even even 2022-01-31 20:17:14 2022-02-02
## 5 4 4 d TRUE even even 2022-01-31 20:17:16 2022-02-04
## 6 6 6 f TRUE even even 2022-01-31 20:17:18 2022-02-06
# #Sort on Multiple Columns with ascending and descending
(bb_2 <- bb[order(bb$ll, -bb$dd), ])
## # A tibble: 6 x 8
## ii dd cc ll ff fo dtm dat
## <int> <dbl> <chr> <lgl> <fct> <ord> <dttm> <date>
## 1 5 5 e FALSE odd odd 2022-01-31 20:17:17 2022-02-05
## 2 3 3 c FALSE odd odd 2022-01-31 20:17:15 2022-02-03
## 3 1 1 a FALSE odd odd 2022-01-31 20:17:13 2022-02-01
## 4 6 6 f TRUE even even 2022-01-31 20:17:18 2022-02-06
## 5 4 4 d TRUE even even 2022-01-31 20:17:16 2022-02-04
## 6 2 2 b TRUE even even 2022-01-31 20:17:14 2022-02-02
#
stopifnot(identical(bb_2, arrange(bb, ll, -dd)))# #To get the Help files on any Topic including 'loaded' Packages
?dplyr
?mutate
# #Help files on any Topic including functions from 'installed' but 'not loaded' Packages
?dplyr::mutate()
# #Operators need Backticks i.e. ` . In keyboards it is located below 'Esc' Key
?`:`
# #To Get the list of All Options used by Base R (including user defined)
?optionsOverview
TRUE, FALSE, or NA.# #At lease one TRUE is present
NA | TRUE
## [1] TRUE
# #Depending upon what the unknown is, the outcome will change
NA | FALSE
## [1] NA
# #Depending upon what the unknown is, the outcome will change
NA & TRUE
## [1] NA
# #At lease one FALSE is present
NA & FALSE
## [1] FALSE
#
# #For length 1 vectors, output of vectorised and non-vectorised forms is same
stopifnot(all(identical(NA || TRUE, NA | TRUE), identical(NA || FALSE, NA | FALSE),
identical(NA && TRUE, NA & TRUE), identical(NA && FALSE, NA & FALSE)))
#
# #But for vectors of >1 length, output is different
x <- 1:5
y <- 5:1
(x > 2) & (y < 3)
## [1] FALSE FALSE FALSE TRUE TRUE
(x > 2) && (y < 3)
## [1] FALSE
#
# # '&&' evaluates only the first element of Vector, thus caution is advised
TRUE & c(TRUE, FALSE)
## [1] TRUE FALSE
TRUE & c(FALSE, FALSE)
## [1] FALSE FALSE
TRUE && c(TRUE, FALSE)
## [1] TRUE
TRUE && c(FALSE, FALSE)
## [1] FALSE
TRUE && all(c(TRUE, FALSE))
## [1] FALSE
TRUE && any(c(TRUE, FALSE))
## [1] TRUEif(exists("x")) rm(x)
exists("x")
## [1] FALSE
#
# # No short-circuit for "|" or "&", Evaluates Right and throws Error
tryCatch( TRUE | x, error = function(e) cat(paste0(e)))
## Error in doTryCatch(return(expr), name, parentenv, handler): object 'x' not found
tryCatch( FALSE & x, error = function(e) cat(paste0(e)))
## Error in doTryCatch(return(expr), name, parentenv, handler): object 'x' not found
#
# #Does not evaluate Right input because outcome already determined
tryCatch( TRUE || x, error = function(e) cat(paste0(e)))
## [1] TRUE
tryCatch( FALSE && x, error = function(e) cat(paste0(e)))
## [1] FALSE
# #evaluates Right input because outcome cannot be determined and throws error
tryCatch( TRUE && x, error = function(e) cat(paste0(e)))
## Error in doTryCatch(return(expr), name, parentenv, handler): object 'x' not found# #any()
any(NA, TRUE)
## [1] TRUE
any(NA, FALSE)
## [1] NA
any(NA, TRUE, na.rm = TRUE)
## [1] TRUE
any(NA, FALSE, na.rm = TRUE)
## [1] FALSE
any(character(0))
## [1] FALSE
#
# #all()
all(NA, TRUE)
## [1] NA
all(NA, FALSE)
## [1] FALSE
all(NA, TRUE, na.rm = TRUE)
## [1] TRUE
all(NA, FALSE, na.rm = TRUE)
## [1] FALSE
all(character(0))
## [1] TRUE\(>\) , \(<\) , \(==\) , \(>=\) , \(<=\) , \(!=\)
# #dplyr::filter() - Filter Rows based on Multiple Columns
bb_1 <- filter(bb, month == 1, day == 1)
dim(bb_1)
## [1] 842 19
# #Filtering by multiple criteria within a single logical expression
stopifnot(identical(bb_1, filter(bb, month == 1 & day == 1)))
#
if(anyNA(bb_1)) {
bb_na <- na.omit(bb_1)
print(paste0("Note: ", length(attributes(bb_na)$na.action), " rows removed."))
} else {
print("NA not found")
}
## [1] "Note: 11 rows removed."
dim(bb_na)
## [1] 831 19dim(bb)
## [1] 336776 19
#
# #Flights in either months of November or Decemeber
dim(bb_2 <- filter(bb, month == 11 | month == 12))
## [1] 55403 19
#
# #Flights with arrival delay '<= 120' or departure delay '<= 120'
# #It excludes flights where arrival & departure BOTH are delayed by >2 hours
# #If either delay is less than 2 hours, the flight is included
dim(bb_3 <- filter(bb, arr_delay <= 120 | dep_delay <= 120))
## [1] 320060 19
dim(bb_4 <- filter(bb, !(arr_delay > 120 & dep_delay > 120)))
## [1] 320060 19
dim(bb_5 <- filter(bb, (!arr_delay > 120 | !dep_delay > 120)))
## [1] 320060 19
#
# #Destination to IAH or HOU
dim(bb_6 <- filter(bb, dest == "IAH" | dest == "HOU"))
## [1] 9313 19
dim(bb_7 <- filter(bb, dest %in% c("IAH", "HOU")))
## [1] 9313 19
#
# #Carrier being "UA", "US", "DL"
dim(bb_8 <- filter(bb, carrier == "UA" | carrier == "US" | carrier == "DL"))
## [1] 127311 19
dim(bb_9 <- filter(bb, carrier %in% c("UA", "US", "DL")))
## [1] 127311 19
#
# #Did not leave late (before /on time departure) but Arrived late by >2 hours
dim(bb_10 <- filter(bb, (arr_delay > 120) & !(dep_delay > 0)))
## [1] 29 19
#
# #Departed between midnight and 6 AM (inclusive)
dim(bb_11 <- filter(bb, (sched_dep_time >= 00 & sched_dep_time <= 600)))
## [1] 8970 19# #subset() - Recommendation is against its usage. Use either '[]' or filter()
dim(bb_12 <- subset(bb, month == 1 | !(dep_delay >= 120),
select = c("flight", "arr_delay")))
## [1] 319760 2
dim(bb_13 <- subset(bb, month == 1 | !(dep_delay >= 120) | carrier == "DL",
select = c("flight", "arr_delay")))
## [1] 321139 2# #Data: mtcars, 32x11, "mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb"
bb <- aa <- mtcars
#str(bb)
#summary(bb)
# #
# #Avoid subset()
ii <- subset(bb, wt > 2 & wt < 3)
# #which()
jj <- bb[which(bb$wt > 2 & bb$wt <= 3), ]
#
# #which() select only TRUE and NOT the NA
(1:2)[which(c(TRUE, NA))]
## [1] 1
(1:2)[c(TRUE, NA)]
## [1] 1 NA
#
# #which() is faster than head()
ee <- bb[which(bb$wt > 2 & bb$wt <= 3)[1:6], ]
ff <- head(bb[bb$wt > 2 & bb$wt <= 3, ], 6)
stopifnot(identical(ee, ff))
#
# #Normal Filter using [] operator
kk <- bb[bb$wt > 2 & bb$wt <= 3, ]
#
# #with()
ll <- with(bb, bb[wt > 2 & wt <= 3, ])
#
# #filter()
mm <- bb %>% filter(wt > 2 & wt <= 3)
#
stopifnot(all(identical(ii, jj), identical(ii, kk), identical(ii, ll), identical(ii, mm)))
#
# #Another set of equivalent operations for OR
ii <- subset(bb, cyl == 4 | cyl == 6)
jj <- bb[bb$cyl %in% c(4, 6), ]
kk <- bb[which(bb$cyl %in% c(4, 6)), ]
ll <- bb %>% filter(cyl == 4 | cyl == 6)
mm <- bb %>% filter(cyl %in% c(4, 6))
#
stopifnot(all(identical(ii, jj), identical(ii, kk), identical(ii, ll), identical(ii, mm)))
#
# #General Conditional Subsetting on Flights data
bb <- xxflights
dim(bb)
## [1] 336776 19
#
dim(bb[which(bb$day == 1 & !(bb$month ==1)), ])
## [1] 10194 19
dim(bb[which(bb$day == 1 | bb$month ==1), ])
## [1] 37198 19
dim(bb[which(bb$day == 1 & bb$month ==1), ])
## [1] 842 19
dim(bb[which(bb$day == 1, bb$month ==1), ])
## [1] 11036 19
dim(bb[which(bb$day == 1 & !(bb$carrier == "DL")), ])
## [1] 9482 19
dim(bb[which(bb$day == 1 | bb$carrier == "DL"), ])
## [1] 57592 19
dim(bb[which(bb$day == 1 & bb$carrier == "DL"), ])
## [1] 1554 19
dim(bb[which(bb$day == 1, bb$carrier == "DL"), ])
## [1] 11036 19\([ \ \ ]\) , \([[ \ \ ]]\) , \(\$\)
dplyr::select()
dim(bb)
## [1] 336776 19
#
# #Subset Consecutive Columns using Colon
stopifnot(identical(bb[ , 2:5], bb[ , -c(1, 6:ncol(bb))]))
#
# #dplyr::select()
bb_14 <- select(bb, year:day, arr_delay, dep_delay, distance, air_time)
bb_15 <- bb %>% select(year:day, arr_delay, dep_delay, distance, air_time)
stopifnot(identical(bb_14, bb_15))bb <- xxflights
# #dplyr::summarise() & dplyr::summarize() are same
# #Get the mean of a column with NA excluded
#
summarize(bb, delay_mean = mean(dep_delay, na.rm = TRUE))
## # A tibble: 1 x 1
## delay_mean
## <dbl>
## 1 12.6
#
# #base::summary()
summary(bb$dep_delay)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -43.00 -5.00 -2.00 12.64 11.00 1301.00 8255
#
# #Grouped Summary
by_ymd <- group_by(bb, year, month, day)
mysum <- summarize(by_ymd,
dep_delay_mean = mean(dep_delay, na.rm = TRUE),
arr_delay_mean = mean(arr_delay, na.rm = TRUE),
.groups = "keep")
# #Equivalent
bb %>%
group_by(year, month, day) %>%
summarize(dep_delay_mean = mean(dep_delay, na.rm = TRUE),
arr_delay_mean = mean(arr_delay, na.rm = TRUE),
.groups= "keep")
## # A tibble: 365 x 5
## # Groups: year, month, day [365]
## year month day dep_delay_mean arr_delay_mean
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2013 1 1 11.5 12.7
## 2 2013 1 2 13.9 12.7
## 3 2013 1 3 11.0 5.73
## 4 2013 1 4 8.95 -1.93
## 5 2013 1 5 5.73 -1.53
## 6 2013 1 6 7.15 4.24
## 7 2013 1 7 5.42 -4.95
## 8 2013 1 8 2.55 -3.23
## 9 2013 1 9 2.28 -0.264
## 10 2013 1 10 2.84 -5.90
## # ... with 355 more rows# #Get delay grouped by distance 'Distance between airports, in miles.'
summary(bb$distance)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17 502 872 1040 1389 4983
#
# #How many unique values are present in this numeric data i.e. factors
str(as.factor(bb$distance))
## Factor w/ 214 levels "17","80","94",..: 163 165 145 171 106 96 138 22 120 99 ...
str(sort(unique(bb$distance)))
## num [1:214] 17 80 94 96 116 143 160 169 173 184 ...
bb %>%
group_by(distance) %>%
summarize(count = n(),
dep_delay_mean = mean(dep_delay, na.rm = TRUE),
arr_delay_mean = mean(arr_delay, na.rm = TRUE),
.groups= "keep")
## # A tibble: 214 x 4
## # Groups: distance [214]
## distance count dep_delay_mean arr_delay_mean
## <dbl> <int> <dbl> <dbl>
## 1 17 1 NaN NaN
## 2 80 49 18.9 16.5
## 3 94 976 17.5 12.7
## 4 96 607 3.19 5.78
## 5 116 443 17.7 7.05
## 6 143 439 23.6 14.4
## 7 160 376 21.8 16.2
## 8 169 545 18.5 15.1
## 9 173 221 7.05 -0.286
## 10 184 5504 3.07 0.123
## # ... with 204 more rows
#
# #For distance =17, there is only 1 flight and that too has NA, so the mean is NaN
bb[bb$distance == 17, ]
## # A tibble: 1 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 2013 7 27 NA 106 NA NA 245 NA US
## # ... with 9 more variables: flight <dbl>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
#
# #In general, Flight to any destination (ex: ABQ) has travelled same distance (1826)
unique(bb %>% filter(dest == "ABQ") %>% select(distance))
## # A tibble: 1 x 1
## distance
## <dbl>
## 1 1826
#
# #Mean Delays for Destinations with more than 1000 miles distance
bb %>%
group_by(dest) %>%
filter(distance > 1000) %>%
summarize(count = n(),
distance_mean = mean(distance, na.rm = TRUE),
dep_delay_mean = mean(dep_delay, na.rm = TRUE),
arr_delay_mean = mean(arr_delay, na.rm = TRUE))
## # A tibble: 48 x 5
## dest count distance_mean dep_delay_mean arr_delay_mean
## <chr> <int> <dbl> <dbl> <dbl>
## 1 ABQ 254 1826 13.7 4.38
## 2 ANC 8 3370 12.9 -2.5
## 3 AUS 2439 1514. 13.0 6.02
## 4 BQN 896 1579. 12.4 8.25
## 5 BUR 371 2465 13.5 8.18
## 6 BZN 36 1882 11.5 7.6
## 7 DEN 7266 1615. 15.2 8.61
## 8 DFW 8738 1383. 8.68 0.322
## 9 DSM 569 1021. 26.2 19.0
## 10 EGE 213 1736. 15.5 6.30
## # ... with 38 more rowsdim(bb)
## [1] 336776 19
#
bb_16 <- select(bb, year:day, arr_delay, dep_delay, distance, air_time)
bb_17 <- mutate(bb_16,
gain = arr_delay - dep_delay,
speed = distance / air_time * 60,
hours = air_time / 60,
gain_per_hour = gain / hours)
# #Equivalent
bb %>%
select(year:day, arr_delay, dep_delay, distance, air_time) %>%
mutate(gain = arr_delay - dep_delay,
speed = distance / air_time * 60,
hours = air_time / 60,
gain_per_hour = gain / hours)
## # A tibble: 336,776 x 11
## year month day arr_delay dep_delay distance air_time gain speed hours gain_per_hour
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2013 1 1 11 2 1400 227 9 370. 3.78 2.38
## 2 2013 1 1 20 4 1416 227 16 374. 3.78 4.23
## 3 2013 1 1 33 2 1089 160 31 408. 2.67 11.6
## 4 2013 1 1 -18 -1 1576 183 -17 517. 3.05 -5.57
## 5 2013 1 1 -25 -6 762 116 -19 394. 1.93 -9.83
## 6 2013 1 1 12 -4 719 150 16 288. 2.5 6.4
## 7 2013 1 1 19 -5 1065 158 24 404. 2.63 9.11
## 8 2013 1 1 -14 -3 229 53 -11 259. 0.883 -12.5
## 9 2013 1 1 -8 -3 944 140 -5 405. 2.33 -2.14
## 10 2013 1 1 8 -2 733 138 10 319. 2.3 4.35
## # ... with 336,766 more rows21.20 A population is the set of all elements of interest in a particular study.
21.23 The process of conducting a survey to collect data for the entire population is called a census.
21.21 A sample is a subset of the population.
27.7 A random sample of size \({n}\) from an infinite population is a sample selected such that the following two conditions are satisfied. Each element selected comes from the same population. Each element is selected independently. The second condition prevents selection bias.
21.25 Statistics uses data from a sample to make estimates and test hypotheses about the characteristics of a population through a process referred to as statistical inference.
Inferential statistics are used for Hypothesis Testing. Refer Statistical Inference
Refer Hypothesis Testing
29.1 Hypothesis testing is a process in which, using data from a sample, an inference is made about a population parameter or a population probability distribution.
29.2 Null Hypothesis \((H_0)\) is a tentative assumption about a population parameter. It is assumed True, by default, in the hypothesis testing procedure.
29.3 Alternative Hypothesis \((H_a)\) is the complement of the Null Hypothesis. It is concluded to be True, if the Null Hypothesis is rejected.
Refer Steps of Hypothesis Testing
29.4 \(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\mu} \geq {\mu}_0 \iff {H_a}: {\mu} < {\mu}_0\)
29.5 \(\text{\{Right or Upper\} } {H_0} : {\mu} \leq {\mu}_0 \iff {H_a}: {\mu} > {\mu}_0\)
29.6 \(\text{\{Two Tail Test \} } \thinspace {H_0} :{\mu} = {\mu}_0 \iff {H_a}: {\mu} \neq {\mu}_0\)
Question: Is there an ideal sample size
27.9 To estimate the value of a population parameter, we compute a corresponding characteristic of the sample, referred to as a sample statistic. This process is called point estimation.
27.10 A sample statistic is the point estimator of the corresponding population parameter. For example, \(\overline{x}, s, s^2, s_{xy}, r_{xy}\) sample statics are point estimators for corresponding population parameters of \({\mu}\) (mean), \({\sigma}\) (standard deviation), \(\sigma^2\) (variance), \(\sigma_{xy}\) (covariance), \(\rho_{xy}\) (correlation)
27.11 The numerical value obtained for the sample statistic is called the point estimate. Estimate is used for sample value only, for population value it would be parameter. Estimate is a value while Estimator is a function.
Example: \({\overline{x}}\) is an estimator (of populataion parameter ‘mean’ \({\mu}\)). Its estimate is 3 and this calculation process is an estimation.
23.8 Given a data set \({X = \{{x}_1, {x}_2, \ldots, {x}_n\}}\), the mean \({\overline{x}}\) is the sum of all of the values \({{x}_1, {x}_2, \ldots, {x}_n}\) divided by the count \({n}\).
Refer Standard Deviation and equation (23.12)
23.15 The standard deviation (\(s, \sigma\)) is defined to be the positive square root of the variance. It is a measure of the amount of variation or dispersion of a set of values.
\[\begin{align} \sigma &= \sqrt{\frac{1}{N} \sum_{i=1}^N \left(x_i - \mu\right)^2} \\ {s} &= \sqrt{\frac{1}{N-1} \sum_{i=1}^N \left(x_i - \overline{x}\right)^2} \end{align}\]
A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the values are spread out over a wider range.
Refer Variance and equation (23.11)
23.14 The variance \(({\sigma}^2)\) is based on the difference between the value of each observation \({x_i}\) and the mean \({\overline{x}}\). The average of the squared deviations is called the variance.
\[\begin{align} \sigma^2 &= \frac{1}{n} \sum _{i=1}^{n} \left(x_i - \mu \right)^2 \\ s^2 &= \frac{1}{n-1} \sum _{i=1}^{n} \left(x_i - \overline{x} \right)^2 \end{align}\]
Variability is most commonly measured with the Range, IQR, SD, and Variance.
The sample we draw from the population is only one from a large number of potential samples.
Refer Standard Error
27.13 In general, standard error \(\sigma_{\overline{x}}\) refers to the standard deviation of a point estimator. The standard error of \({\overline{x}}\) is the standard deviation of the sampling distribution of \({\overline{x}}\). It is the indicator of ‘Sampling Fluctuation.’
27.14 A sampling error is the difference between a population parameter and a sample statistic.
Sampling fluctuation (Standard Error) refers to the extent to which a statistic (mean, median, mode, sd etc.) takes on different values with different samples i.e. it refers to how much the value of the statistic fluctuates from sample to sample.
27.12 The sampling distribution of \({\overline{x}}\) is the probability distribution of all possible values of the sample mean \({\overline{x}}\).
Standard Deviation of \({\overline{x}}\), \(\sigma_{\overline{x}}\) is given by equation (27.1) i.e. \(\sigma_{\overline{x}} = \frac{\sigma}{\sqrt{n}}\)
Refer Test Statistic
29.11 Test statistic is a number calculated from a statistical test of a hypothesis. It shows how closely the observed data match the distribution expected under the null hypothesis of that statistical test. It helps determine whether a null hypothesis should be rejected.
29.14 If \({\sigma}\) is known, the standard normal random variable \({z}\) is used as test statistic to determine whether \({\overline{x}}\) deviates from the hypothesized value of \({\mu}\) enough to justify rejecting the null hypothesis. Refer equation (29.1) \(\to z = \frac{\overline{x} - {\mu}_0}{{\sigma}_{\overline{x}}} = \frac{\overline{x} - {\mu}_0}{{\sigma}/\sqrt{n}}\)
Standard Error (SE) is same as ‘the standard deviation of the sampling distribution.’ The ‘variance of the sampling distribution’ is the Variance of the data divided by N.
# #DataSet: Height of 5 people in 'cm'
hh <- c(170.5, 161, 160, 170, 150.5)
#
# #N by length()
print(hh_len <- length(hh))
## [1] 5
#
# #Mean by mean()
hh_mean <- mean(hh)
cat("Mean = ", hh_mean)
## Mean = 162.4
#
# #Variance by var()
hh_var <- round(var(hh), 3)
cat("Variance = ", hh_var)
## Variance = 68.175
#
# #Standard Deviation (SD) by sd()
hh_sd <- round(sd(hh), 3)
cat("Standard Deviation (SD) = ", hh_sd)
## Standard Deviation (SD) = 8.257
#
# #Standard Error (SE)
hh_se_sd <- round(hh_sd / sqrt(hh_len), 3)
cat("Standard Error (SE) = ", hh_se_sd)
## Standard Error (SE) = 3.693# #DataSet: Height of 5 people in 'cm'
print(hh)
## [1] 170.5 161.0 160.0 170.0 150.5
#
# #N by length()
print(hh_len <- length(hh))
## [1] 5
#
# #sum by sum()
print(hh_sum <- sum(hh))
## [1] 812
#
# #Mean by mean()
hh_mean <- mean(hh)
hh_mean_cal <- hh_sum / hh_len
stopifnot(identical(hh_mean, hh_mean_cal))
cat("Mean = ", hh_mean)
## Mean = 162.4
#
# #Calculate the deviation from the mean by subtracting each value from the mean
print(hh_dev <- hh - hh_mean)
## [1] 8.1 -1.4 -2.4 7.6 -11.9
#
# #Square the deviation
print(hh_sqdev <- hh_dev^2)
## [1] 65.61 1.96 5.76 57.76 141.61
#
# #Get Sum of the squared deviations
print(hh_sqdev_sum <- sum(hh_sqdev))
## [1] 272.7
#
# #Divide it by the 'sample size (N) – 1' for the Variance or use var()
hh_var <- round(var(hh), 3)
hh_var_cal <- hh_sqdev_sum / (hh_len -1)
stopifnot(identical(hh_var, hh_var_cal))
cat("Variance = ", hh_var)
## Variance = 68.175
#
# #Variance of the sampling distribution
hh_var_sample <- hh_var / hh_len
cat("Variance of the Sampling Distribution = ", hh_var)
## Variance of the Sampling Distribution = 68.175
#
# #Take square root of the Variance for the Standard Deviation (SD) or use sd()
hh_sd_cal <- round(sqrt(hh_var), 3)
hh_sd <- sd(hh)
stopifnot(identical(round(hh_sd, 3), hh_sd_cal))
cat("Standard Deviation (SD) = ", hh_sd)
## Standard Deviation (SD) = 8.256815
#
# #Standard Error (SE)
# #SE
# #Divide the SD by the square root of the sample size for the Standard Error (SE)
# #
hh_se_sd <- round(hh_sd / sqrt(hh_len), 3)
#
# #Calculate SE from Variance
hh_se_var <- round(sqrt(hh_var_sample), 3)
stopifnot(identical(hh_se_sd, hh_se_var))
cat("Standard Error (SE) = ", hh_se_sd)
## Standard Error (SE) = 3.693Using Dataset Flights : “air_time” -Amount of time spent in the air, in minutes. Refer figure 4.1
Figure 4.1 Flights: Air Time (min) excluding NA (Histogram and Density)
# #Remove All NA
aa <- na.omit(xxflights$air_time)
attr(aa, "na.action") <- NULL
str(aa)
## num [1:327346] 227 227 160 183 116 150 158 53 140 138 ...
summary(aa)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20.0 82.0 129.0 150.7 192.0 695.0# #Overview of Data after removal of NA
bb <- aa
stopifnot(is.null(dim(bb)))
summary(bb)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20.0 82.0 129.0 150.7 192.0 695.0
# #min(), max(), range(), summary()
min_bb <- summary(bb)[1]
max_bb <- summary(bb)[6]
range_bb <- max_bb - min_bb
cat(paste0("Range = ", range_bb, " (", min_bb, ", ", max_bb, ")\n"))
## Range = 675 (20, 695)
# #IQR(), summary()
iqr_bb <- IQR(bb)
cat(paste0("IQR = ", iqr_bb, " (", summary(bb)[2], ", ", summary(bb)[5], ")\n"))
## IQR = 110 (82, 192)
# #median(), mean(), summary()[3], summary()[4]
median_bb <- median(bb)
cat("Median =", median_bb, "\n")
## Median = 129
mu_mean_bb <- mean(bb)
cat("Mean \u03bc =", mu_mean_bb, "\n")
## Mean µ = 150.6865
#
sigma_sd_bb <- sd(bb)
cat("SD (sigma) \u03c3 =", sigma_sd_bb, "\n")
## SD (sigma) s = 93.6883
#
variance_bb <- var(bb)
cat(sprintf('Variance (sigma)%s %s%s =', '\u00b2', '\u03c3', '\u00b2'), variance_bb, "\n")
## Variance (sigma)² s² = 8777.498# #Histogram
bb <- na.omit(xxflights$air_time)
hh <- tibble(ee = bb)
# #Basics
median_hh <- round(median(hh[[1]]), 1)
mean_hh <- round(mean(hh[[1]]), 1)
sd_hh <- round(sd(hh[[1]]), 1)
len_hh <- nrow(hh)
#
B12P01 <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) +
geom_histogram(bins = 50, alpha = 0.4, fill = '#FDE725FF') +
geom_vline(aes(xintercept = mean_hh), color = '#440154FF') +
geom_text(data = tibble(x = mean_hh, y = -Inf,
label = paste0("Mean= ", mean_hh)),
aes(x = x, y = y, label = label),
color = '#440154FF', hjust = -0.5, vjust = 1.3, angle = 90) +
geom_vline(aes(xintercept = median_hh), color = '#3B528BFF') +
geom_text(data = tibble(x = median_hh, y = -Inf,
label = paste0("Median= ", median_hh)),
aes(x = x, y = y, label = label),
color = '#3B528BFF', hjust = -0.5, vjust = -0.7, angle = 90) +
theme(plot.title.position = "panel") +
labs(x = "x", y = "Frequency",
subtitle = paste0("(N=", len_hh, "; ", "Mean= ", mean_hh,
"; Median= ", median_hh, "; SD= ", sd_hh,
")"),
caption = "B12P01", title = "Flights: Air Time")
}# #Density Curve
# #Get Quantiles and Ranges of mean +/- sigma
q05_hh <- quantile(hh[[1]], .05)
q95_hh <- quantile(hh[[1]], .95)
density_hh <- density(hh[[1]])
density_hh_tbl <- tibble(x = density_hh$x, y = density_hh$y)
sig3r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + 3 * sd_hh})
sig3l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - 3 * sd_hh})
sig2r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + 2 * sd_hh}, {x < mean_hh + 3 * sd_hh})
sig2l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - 2 * sd_hh}, {x > mean_hh - 3 * sd_hh})
sig1r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + sd_hh}, {x < mean_hh + 2 * sd_hh})
sig1l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - sd_hh}, {x > mean_hh - 2 * sd_hh})
sig0r_hh <- density_hh_tbl %>% filter(x > mean_hh, {x < mean_hh + 1 * sd_hh})
sig0l_hh <- density_hh_tbl %>% filter(x < mean_hh, {x > mean_hh - 1 * sd_hh})
#
# #Change x-Axis Ticks interval
xbreaks_hh <- seq(-3, 3)
xpoints_hh <- mean_hh + xbreaks_hh * sd_hh
#
# # Latex Labels
xlabels_hh <- c(TeX(r'($\,\,\mu - 3 \sigma$)'), TeX(r'($\,\,\mu - 2 \sigma$)'),
TeX(r'($\,\,\mu - 1 \sigma$)'), TeX(r'($\mu$)'), TeX(r'($\,\,\mu + 1 \sigma$)'),
TeX(r'($\,\,\mu + 2 \sigma$)'), TeX(r'($\,\,\mu + 3\sigma$)'))
#
B12P02 <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) +
geom_density(alpha = 0.2, colour = "#21908CFF") +
geom_area(data = sig3l_hh, aes(x = x, y = y), fill = '#440154FF') +
geom_area(data = sig3r_hh, aes(x = x, y = y), fill = '#440154FF') +
geom_area(data = sig2l_hh, aes(x = x, y = y), fill = '#3B528BFF') +
geom_area(data = sig2r_hh, aes(x = x, y = y), fill = '#3B528BFF') +
geom_area(data = sig1l_hh, aes(x = x, y = y), fill = '#21908CFF') +
geom_area(data = sig1r_hh, aes(x = x, y = y), fill = '#21908CFF') +
geom_area(data = sig0l_hh, aes(x = x, y = y), fill = '#5DC863FF') +
geom_area(data = sig0r_hh, aes(x = x, y = y), fill = '#5DC863FF') +
#scale_y_continuous(limits = c(0, 0.009), breaks = seq(0, 0.009, 0.003)) +
scale_x_continuous(breaks = xpoints_hh, labels = xlabels_hh) +
ggplot2::annotate("segment", x = xpoints_hh[4] - 0.5 * sd_hh, xend = xpoints_hh[2], y = 0.007,
yend = 0.007, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
ggplot2::annotate("segment", x = xpoints_hh[4] + 0.5 * sd_hh, xend = xpoints_hh[6], y = 0.007,
yend = 0.007, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
ggplot2::annotate(geom = "text", x = xpoints_hh[4], y = 0.007, label = "95.4%") +
theme(plot.title.position = "panel") +
labs(x = "x", y = "Density",
subtitle = paste0("(N=", nrow(.), "; ", "Mean= ", round(mean(.[[1]]), 1),
"; Median= ", round(median(.[[1]]), 1), "; SD= ", round(sd(.[[1]]), 1),
")"),
caption = "B12P02", title = "Flights: Air Time")
}Using Dataset Flights : “air_time” -Amount of time spent in the air, in minutes.
Caution: Trend here does not match with the theory. However, the exercise shows the ‘How to do it’ part. It can be repeated with better data, larger sample size, or repeat sampling.
Figure 4.2 Effect of Increasing Sample Size
Figure 4.3 Effect of Increasing Sample Size
bb <- na.omit(xxflights$air_time)
# #Fix Seed
set.seed(3)
# #Set Sample Size
#nn <- 100L
# #Take a sample from dataset
xb100 <- sample(bb, size = 100L)
xb1000 <- sample(bb, size = 1000L)
xb10000 <- sample(bb, size = 10000L)
# #Population Mean
mu_hh <- round(mean(bb), 1)
# #Histogram: N = 100
hh <- tibble(ee = xb100)
ylim_hh <- 12.5
cap_hh <- "B12P03"# #Assumes 'hh' has data in 'ee'. In: mu_hh, cap_hh, ylim_hh
#
B12 <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) +
geom_histogram(bins = 50, alpha = 0.4, fill = '#FDE725FF') +
geom_vline(aes(xintercept = mean(.data[["ee"]])), color = '#440154FF') +
geom_text(aes(label = TeX(r'($\bar{x}$)', output = "character"),
x = mean(.data[["ee"]]), y = -Inf),
color = '#440154FF', hjust = 2, vjust = -2.5, parse = TRUE, check_overlap = TRUE) +
geom_vline(aes(xintercept = mu_hh), color = '#3B528BFF') +
geom_text(aes(label = TeX(r'($\mu$)', output = "character"), x = mu_hh, y = -Inf),
color = '#3B528BFF', hjust = -1, vjust = -2, parse = TRUE, check_overlap = TRUE) +
coord_cartesian(xlim = c(0, 800), ylim = c(0, ylim_hh)) +
theme(plot.title.position = "panel") +
labs(x = "x", y = "Frequency",
subtitle = paste0("(Mean= ", round(mean(.[[1]]), 1),
"; SD= ", round(sd(.[[1]]), 1),
#"; Var= ", round(var(.[[1]]), 1),
"; SE= ", round(sd(.[[1]]) / sqrt(nrow(.)), 1),
")"),
caption = cap_hh, title = paste0("Sample Size = ", nrow(.)))
}
assign(cap_hh, B12)
rm(B12)Figure 4.4 Effect of Increasing Sample Size
Figure 4.5 Effect of Increasing Sampling
bb <- na.omit(xxflights$air_time)
# #Fix Seed
set.seed(3)
# #Set Sample Size
nn <- 10L
# #Set Repeat Sampling Rate
rr <- 20L
# #Take Sample of N = 10, get mean, repeat i.e. get distribution of mean
xr20 <- replicate(rr, mean(sample(bb, size = nn)))
rr <- 200L
xr200 <- replicate(rr, mean(sample(bb, size = nn)))
rr <- 2000L
xr2000 <- replicate(rr, mean(sample(bb, size = nn)))
#
# #Population Mean
mu_hh <- round(mean(bb), 1)
# #Histogram: N = 10, Repeat = 20
hh <- tibble(ee = xr20)
ylim_hh <- 2
cap_hh <- "B12P06"# #Assumes 'hh' has data in 'ee'. In: mu_hh, cap_hh, ylim_hh, nn
#
B12 <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) +
geom_histogram(bins = 50, alpha = 0.4, fill = '#FDE725FF') +
geom_vline(aes(xintercept = mean(.data[["ee"]])), color = '#440154FF') +
geom_text(aes(label = TeX(r'($E(\bar{x})$)', output = "character"),
x = mean(.data[["ee"]]), y = -Inf),
color = '#440154FF', hjust = 1.5, vjust = -1.5, parse = TRUE, check_overlap = TRUE) +
geom_vline(aes(xintercept = mu_hh), color = '#3B528BFF') +
geom_text(aes(label = TeX(r'($\mu$)', output = "character"), x = mu_hh, y = -Inf),
color = '#3B528BFF', hjust = -1, vjust = -2, parse = TRUE, check_overlap = TRUE) +
coord_cartesian(xlim = c(0, 800), ylim = c(0, ylim_hh)) +
theme(plot.title.position = "panel") +
labs(x = TeX(r'($\bar{x} \, (\neq x)$)'), y = TeX(r'(Frequency of $\, \bar{x}$)'),
subtitle = TeX(sprintf(
"($\\mu$=%.0f) $E(\\bar{x}) \\, =$%.0f; $\\sigma_{\\bar{x}} \\, =$%.0f",
mu_hh, round(mean(.[[1]]), 1), round(sd(.[[1]])))),
caption = cap_hh,
title = paste0("Sampling Distribution (N = ", nn, ") & Repeat Sampling = ", nrow(.)))
}
assign(cap_hh, B12)
rm(B12)Figure 4.6 Normal Distribution
Refer Normal Distribution and equation (26.2)
26.3 A normal distribution (\({\mathcal{N}}_{({\mu}, \, {\sigma}^2)}\)) is a type of continuous probability distribution for a real-valued random variable.
Their importance is partly due to the Central Limit Theorem. Assumption of normal distribution allow us application of Parametric Methods
38.1 Parametric methods are the statistical methods that begin with an assumption about the probability distribution of the population which is often that the population has a normal distribution. A sampling distribution for the test statistic can then be derived and used to make an inference about one or more parameters of the population such as the population mean \({\mu}\) or the population standard deviation \({\sigma}\).
27.15 Central Limit Theorem: In selecting random samples of size \({n}\) from a population, the sampling distribution of the sample mean \({\overline{x}}\) can be approximated by a normal distribution as the sample size becomes large.
It states that, under some conditions, the average of many samples (observations) of a random variable with finite mean and variance is itself a random variable—whose distribution converges to a normal distribution as the number of samples increases.
Parametric statistical tests typically assume that samples come from normally distributed populations, but the central limit theorem means that this assumption is not necessary to meet when you have a large enough sample. A sample size of 30 or more is generally considered large.
This is the basis of Empirical Rule.
23.23 Empirical rule is used to compute the percentage of data values that must be within one, two, and three standard deviations \({\sigma}\) of the mean \({\mu}\) for a normal distribution. These probabilities are Pr(x) 68.27%, 95.45%, and 99.73%.
Caution: If data from small samples do not closely follow this pattern, then other distributions like the t-distribution may be more appropriate.
Refer Standard Normal and equation (26.3)
26.4 A random variable that has a normal distribution with a mean of zero \(({\mu} = 0)\) and a standard deviation of one \(({\sigma} = 1)\) is said to have a standard normal probability distribution. The z-distribution is given by \({\mathcal{z}}_{({\mu} = 0, \, {\sigma} = 1)}\)
The simplest case of a normal distribution is known as the standard normal distribution. Given the Population with normal distribution \({\mathcal{N}}_{(\mu, \, \sigma)}\)
If \(\overline{X}\) is the mean of a sample of size \({n}\) from this population, then the standard error is \(\sigma/{\sqrt{n}}\) and thus the z-score is \(Z=\frac {\overline{X} - \mu}{\sigma/{\sqrt{n}}}\)
The z-score is the test statistic used in a z-test. The z-test is used to compare the means of two groups, or to compare the mean of a group to a set value. Its null hypothesis typically assumes no difference between groups.
The area under the curve to the right of a z-score is the p-value, and it is the likelihood of your observation occurring if the null hypothesis is true.
Usually, a p-value of 0.05 or less means that your results are unlikely to have arisen by chance; it indicates a statistically significant effect.
Refer Outliers: C03
23.24 Sometimes unusually large or unusually small values are called outliers. It is a data point that differs significantly from other observations.
Figure 4.7 Type-I \((\alpha)\) and Type-II \((\beta)\) Errors
Example
Since we are using sample data to make inferences about the population, it is possible that we will make an error. In the case of the Null Hypothesis, we can make one of two errors.
Refer Type I and Type II Errors
29.7 The error of rejecting \({H_0}\) when it is true, is Type I error \(({\alpha})\).
29.8 The error of accepting \({H_0}\) when it is false, is Type II error \(({\beta})\).
29.9 The level of significance \((\alpha)\) is the probability of making a Type I error when the null hypothesis is true as an equality.
28.3 The confidence level expressed as a decimal value is the confidence coefficient \(({\gamma} = 1 - {\alpha})\). i.e. 0.95 is the confidence coefficient for a 95% confidence level.
29.28 The probability of correctly rejecting \({H_0}\) when it is false is called the power of the test. For any particular value of \({\mu}\), the power is \(1 - \beta\).
There is always a tradeoff between Type-I and Type-II errors.
In practice, the person responsible for the hypothesis test specifies the level of significance. By selecting \({\alpha}\), that person is controlling the probability of making a Type I error.
29.10 Applications of hypothesis testing that only control for the Type I error \((\alpha)\) are called significance tests.
Although most applications of hypothesis testing control for the probability of making a Type I error, they do not always control for the probability of making a Type II error. Because of the uncertainty associated with making a Type II error when conducting significance tests, statisticians usually recommend that we use the statement "do not reject \({H_0}\)" instead of “accept \({H_0}\).”
Figure 4.8 Left Tail vs. Right Tail
Figure 4.9 Two Tail
29.18 Critical value is the value that is compared with the test statistic to determine whether \({H_0}\) should be rejected. Significance level \({\alpha}\), or confidence level (\(1 - {\alpha}\)), dictates the critical value (\(Z\)), or critical limit. Ex: For Upper Tail Test, \(Z_{{\alpha} = 0.05} = 1.645\).
# #Critical Value (z) for Common Significance level Alpha (α) or Confidence level (1-α)
xxalpha <- c("10%" = 0.1, "5%" = 0.05, "5/2%" = 0.025, "1%" = 0.01, "1/2%" = 0.005)
#
# #Left Tail Test
round(qnorm(p = xxalpha, lower.tail = TRUE), 4)
## 10% 5% 5/2% 1% 1/2%
## -1.2816 -1.6449 -1.9600 -2.3263 -2.5758
#
# #Right Tail Test
round(qnorm(p = xxalpha, lower.tail = FALSE), 4)
## 10% 5% 5/2% 1% 1/2%
## 1.2816 1.6449 1.9600 2.3263 2.575829.16 A p-value is a probability that provides a measure of the evidence against the null hypothesis provided by the sample. The p-value is used to determine whether the null hypothesis should be rejected. Smaller p-values indicate more evidence against \({H_0}\).
29.19 A acceptance region (confidence interval), is a set of values for the test statistic for which the null hypothesis is accepted. i.e. if the observed test statistic is in the confidence interval then we accept the null hypothesis and reject the alternative hypothesis.
29.21 A rejection region (critical region), is a set of values for the test statistic for which the null hypothesis is rejected. i.e. if the observed test statistic is in the critical region then we reject the null hypothesis and accept the alternative hypothesis.
29.12 A one-tailed test and a two-tailed test are alternative ways of computing the statistical significance of a parameter inferred from a data set, in terms of a test statistic.
One tailed-tests are concerned with one side of a statistic. Whereas, Two-tailed tests deal with both tails of the distribution.
Two-tail test is done when you do not know about direction, so you test for both sides.
29.4 \(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\mu} \geq {\mu}_0 \iff {H_a}: {\mu} < {\mu}_0\)
29.5 \(\text{\{Right or Upper\} } {H_0} : {\mu} \leq {\mu}_0 \iff {H_a}: {\mu} > {\mu}_0\)
29.6 \(\text{\{Two Tail Test \} } \thinspace {H_0} :{\mu} = {\mu}_0 \iff {H_a}: {\mu} \neq {\mu}_0\)
29.15 The p-value approach uses the value of the test statistic \({z}\) to compute a probability called a p-value.
Steps for the p-value approach or test statistic approach
29.17 The critical value approach requires that we first determine a value for the test statistic called the critical value.
Steps for the critical value approach
If the population standard error (SE) is known, apply z-test. If it is unknown, apply t-test. t-test will converge to z-test with increasing sample size.
Question: Does the probability from t-table differ from the probability value from z-table
It is assumed that \((\overline{x} - \mu)\) follows Normality. However the Standard Error (SE) does not follow normality, generally it follows chi-sq distribution. Thus, \((\overline{x} - \mu)/SE\) becomes ‘Normal/ChiSq’ and this ratio follows the t-distribution. Thus, the test we apply is called t-test.
# #For Degrees of Freedom = 10 (N=11)
# #Critical Value (z) for Common Significance level Alpha (α) or Confidence level (1-α)
xxalpha <- c("10%" = 0.1, "5%" = 0.05, "5/2%" = 0.025, "1%" = 0.01, "1/2%" = 0.005)
dof <- 10L
#
# #Left Tail Test
round(qt(p = xxalpha, df = dof, lower.tail = TRUE), 4)
## 10% 5% 5/2% 1% 1/2%
## -1.3722 -1.8125 -2.2281 -2.7638 -3.1693
#
# #Right Tail Test
round(qt(p = xxalpha, df = dof, lower.tail = FALSE), 4)
## 10% 5% 5/2% 1% 1/2%
## 1.3722 1.8125 2.2281 2.7638 3.169328.5 The number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary. In general, the degrees of freedom of an estimate of a parameter are \((n - 1)\).
Why \((n-1)\) are the degrees of freedom
Question: Is there any minimum sample size we must consider before calculating degrees of freedom
Guess: Degrees of freedom is also calculated to remove the possible bias
Figure 5.1 Type-I \((\alpha)\) and Type-II \((\beta)\) Errors
Refer Type I and Type II Errors (B12) & Type I and Type II Errors
29.7 The error of rejecting \({H_0}\) when it is true, is Type I error \(({\alpha})\).
29.8 The error of accepting \({H_0}\) when it is false, is Type II error \(({\beta})\).
29.9 The level of significance \((\alpha)\) is the probability of making a Type I error when the null hypothesis is true as an equality.
28.3 The confidence level expressed as a decimal value is the confidence coefficient \(({\gamma} = 1 - {\alpha})\). i.e. 0.95 is the confidence coefficient for a 95% confidence level.
29.28 The probability of correctly rejecting \({H_0}\) when it is false is called the power of the test. For any particular value of \({\mu}\), the power is \(1 - \beta\).
29.10 Applications of hypothesis testing that only control for the Type I error \((\alpha)\) are called significance tests.
29.23 p-value Approach: Form Hypothesis | Specify \({\alpha}\) | Calculate test statistic | Calculate p-value | Compare p-value with \({\alpha}\) | Interpret
27.13 In general, standard error \(\sigma_{\overline{x}}\) refers to the standard deviation of a point estimator. The standard error of \({\overline{x}}\) is the standard deviation of the sampling distribution of \({\overline{x}}\). It is the indicator of ‘Sampling Fluctuation.’
Population Size = 100, \({\alpha} = 0.05\)
Hypothesis: \(\text{\{Right Tail or Upper Tail\} } {H_0} : {\mu} \leq 22 \iff {H_a}: {\mu} > 22\)
Sample: n=4, dof = 3, \({\overline{x}} = 23\)
Sample: n=10, dof = 9, \({\overline{x}} = 23\)
We know if we take another sample, we will have a different sample mean. So, we need to confirm whether the above calculated sample mean \({\overline{x}} = 23\) represent the population mean \({\mu}\) i.e. Can we reject or fail to reject \({H_0}\) based on this sample!
3 Approaches for Hypothesis Testing -
If the population standard error (SE) is known, apply z-test. If it is unknown, apply t-test. t-test will converge to z-test with increasing sample size.
2-T Rule of Thumb - Skipped “09:55”
# #Get P(z)
z01 <- round(pnorm(3.44), digits = 6)
z02 <- 1 - round(pnorm(3.44), digits = 6)
z03 <- round(pnorm(3.44, lower.tail = FALSE), digits = 6)
z04 <- format(pnorm(4.55, lower.tail = FALSE), digits = 3, scientific = FALSE)
z05 <- format(pnorm(1.22, lower.tail = FALSE), digits = 5)
z06 <- format(pnorm(1.99, lower.tail = FALSE), digits = 5)
z07 <- format(pnorm(1.99, lower.tail = TRUE), digits = 5)Example:
# #Get P(z)
z01 <- round(pnorm(3.44), digits = 6)
z02 <- 1 - round(pnorm(3.44), digits = 6)
z03 <- round(pnorm(3.44, lower.tail = FALSE), digits = 6)
z04 <- format(pnorm(4.55, lower.tail = FALSE), digits = 3, scientific = FALSE)
z05 <- format(pnorm(1.22, lower.tail = FALSE), digits = 5)
z06 <- format(pnorm(1.99, lower.tail = FALSE), digits = 5)
z07 <- format(pnorm(1.99, lower.tail = TRUE), digits = 5)30.2 \(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\mu}_1 - {\mu}_2 \geq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 < {D_0}\)
30.3 \(\text{\{Right or Upper\} } {H_0} : {\mu}_1 - {\mu}_2 \leq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 > {D_0}\)
30.4 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {\mu}_1 - {\mu}_2 = {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 \neq {D_0}\)
Example:
30.6 Independent sample design: A simple random sample of workers is selected and each worker in the sample uses method 1. A second independent simple random sample of workers is selected and each worker in this sample uses method 2.
30.7 Matched sample design: One simple random sample of workers is selected. Each worker first uses one method and then uses the other method. The order of the two methods is assigned randomly to the workers, with some workers performing method 1 first and others performing method 2 first. Each worker provides a pair of data values, one value for method 1 and another value for method 2.
Test Statistic for Independent Sample t-Test Statistic is given by (30.9) as shown below
\[t = \frac{({\overline{x}}_1 - {\overline{x}}_2) - {D}_0}{{\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)}} = \frac{({\overline{x}}_1 - {\overline{x}}_2) - {D}_0}{\sqrt{\frac{{s}_1^2}{{n}_1} + \frac{{s}_2^2}{{n}_2}}}\]
The t-test is any statistical hypothesis test in which the test statistic follows a Student t-distribution under the null hypothesis.
A t-test is the most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known.
Example: If we want to evaluate effect of a Training Program.
We can take two samples of 50 people each. First Set “Untrained” would be from the set of people who did not receive training. Second Set “Trained” would be from the set of people who have undergone the training. Comparison of these two sample mean performances would be done by “independent sample” t-test.
Or
We can take a sample of 50 “Untrained” people. Get their mean performance. Provide the training of these 50 people. Then again get their mean performance. Now, we have “paired” samples of performances of same people. One set has their performance before the training and another is after the training. Comparison of these two sample mean performances would be done by “paired sample” t-test.
Paired samples t-tests typically consist of a sample of matched pairs of similar units, or one group of units that has been tested twice (a “repeated measures” t-test).
The matched sample design is generally preferred to the independent sample design because the matched-sample procedure often improves the precision of the estimate.
Assume there are 3 samples A, B, C. We can do \(C_2^3 = 3\) number of tests i.e. \(\{(A, B), (B, C), (C, A)\}\). However, assuming \({\alpha} = 0.05 \iff {\gamma} = 0.95\) for each test, the confidence for 3 consecutive tests become \({\gamma}^3 = 0.857 \iff {\alpha} = 0.143\), which is a very high and unacceptable value. To avoid this, we use ANOVA as a single test.
High value of F-test would indicate that the populations are different.
29.14 If \({\sigma}\) is known, the standard normal random variable \({z}\) is used as test statistic to determine whether \({\overline{x}}\) deviates from the hypothesized value of \({\mu}\) enough to justify rejecting the null hypothesis. Refer equation (29.1) \(\to z = \frac{\overline{x} - {\mu}_0}{{\sigma}_{\overline{x}}} = \frac{\overline{x} - {\mu}_0}{{\sigma}/\sqrt{n}}\)
29.24 If \({\sigma}\) is unknown, the sampling distribution of the test statistic follows the t distribution with \((n - 1)\) degrees of freedom. Refer equation (29.3) \(\to t = \frac{{\overline{x}} - {\mu}_0}{{s}/\sqrt{n}}\)
Please import the WSES data in xxWSES object. Due to copyright, it has not been shared.
29.5 \(\text{\{Right or Upper\} } {H_0} : {\mu} \leq {\mu}_0 \iff {H_a}: {\mu} > {\mu}_0\)
Caution: While importing data, for Mac users, probably it will be easier to import CSV file. However, I am not a Mac user so cannot comment on this.
# #Import the Data and assign to a temporary variable for ease of use
xxWSES <- f_getRDS(xxWSES)
bb <- xxWSES
str(bb)
## spec_tbl_df [1,000 x 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Opportunity No. : num [1:1000] 1 2 3 4 5 6 7 8 9 10 ...
## $ Reporting Status : chr [1:1000] "Lost" "Won" "Lost" "Won" ...
## $ Sales Outcome : num [1:1000] 0 1 0 1 1 0 0 0 0 0 ...
## $ Product : chr [1:1000] "LearnSys" "GTMSys" "GTMSys" "GTMSys" ...
## $ Industry : chr [1:1000] "Banks" "Airline" "Capital Markets" "Insurance" ...
## $ Region : chr [1:1000] "Africa" "UK" "UK" "UK" ...
## $ Relative Strength in the segment: num [1:1000] 45 56 48 58 49 64 54 54 67 54 ...
## $ Profit of Customer in Million : num [1:1000] 2.11 0.79 1.62 0.09 1.46 0.94 1.32 1.09 1.3 1.32 ...
## $ Sales Value in Million : num [1:1000] 10.29 11.42 5.63 10.17 10.6 ...
## $ Profit % : num [1:1000] 29 46 70 46 32 65 50 74 46 57 ...
## $ WSES Proportion in Joint Bid : num [1:1000] 66 50 50 56 54 52 68 48 56 63 ...
## $ Leads Conversion Class : chr [1:1000] "F" "E" "F" "E" ...# #List Column Headers
names(bb)
## [1] "Opportunity No." "Reporting Status"
## [3] "Sales Outcome" "Product"
## [5] "Industry" "Region"
## [7] "Relative Strength in the segment" "Profit of Customer in Million"
## [9] "Sales Value in Million" "Profit %"
## [11] "WSES Proportion in Joint Bid" "Leads Conversion Class"
#
# #Rename Headers
bb_headers <- c("SN" , "RS" , "SO" , "PDT" , "INT" , "RG" , "RS1" , "PM" , "SVM" , "PP" , "JB" , "LCC")
names(bb) <- bb_headers
#
# #Verification
names(bb)
## [1] "SN" "RS" "SO" "PDT" "INT" "RG" "RS1" "PM" "SVM" "PP" "JB" "LCC"
#From the case study, it can be seen that multiple columns are categorical (factor) or ordinal (ordered factor)
Question: What is the importance of having this kind of order factor over simple factor
Question: If ‘RS’ is already integer 0 & 1, then why convert it to factor
Question: Why LCC is NOT ordinal (Aside)
# #"Reporting Status i.e. RS" Converting "character" to "factor" and Label them
bb$RS <- factor(bb$RS, levels = c("Lost", "Won"), labels = c("0", "1"))
#
# #"Sales Outcome i.e. SO" Converting "numeric" to "factor"
bb$SO <- factor(bb$SO)
#
# #"Product Vertical i.e. PDT" Ordinal
# #What are the unique values in this column
unique(bb$PDT)
## [1] "LearnSys" "GTMSys" "Lifesys" "Finsys" "Procsys" "Logissys" "ContactSys"
#
# #Converting "character" to "Ordered factor"
# #Note: If level order is not provided, by default, alphabatical ordering will be assigned.
levels(factor(bb$PDT, ordered = TRUE))
## [1] "ContactSys" "Finsys" "GTMSys" "LearnSys" "Lifesys" "Logissys" "Procsys"
#
# #Provide ordering of factor levels in Ascending Order.
bb$PDT <- factor(bb$PDT, ordered = TRUE,
levels = c("GTMSys", "Procsys", "LearnSys", "Finsys", "Lifesys", "Logissys", "ContactSys"))
#
# #"Industry i.e. INT" Ordinal
bb$INT <- factor(bb$INT, ordered = TRUE,
levels = c("Capital Markets", "Banks", "Defense", "Consumer goods", "Others", "Security",
"Energy", "Insurance", "Airline", "Finance", "Infrastructure", "Mobility", "Other Govt.",
"Govt.", "Telecom equipments", "Health", "Clinical research", "Agriculture"))
#
# #"Region i.e. RG" Ordinal
bb$RG <- factor(bb$RG, ordered = TRUE, levels = c("UK", "Other Europe", "Americas", "Africa",
"India", "Japan", "Singapore", "Spain", "Canada"))
#
# #"Leads Conversion Class i.e. LCC" Ordinal, However we are going with Nominal here.
bb$LCC <- factor(bb$LCC, levels = c("E", "V", "F", "L"), labels = c(1, 2, 3, 4))str(bb)
## spec_tbl_df [1,000 x 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ SN : num [1:1000] 1 2 3 4 5 6 7 8 9 10 ...
## $ RS : Factor w/ 2 levels "0","1": 1 2 1 2 2 1 1 1 1 1 ...
## $ SO : Factor w/ 2 levels "0","1": 1 2 1 2 2 1 1 1 1 1 ...
## $ PDT: Ord.factor w/ 7 levels "GTMSys"<"Procsys"<..: 3 1 1 1 1 5 4 4 1 4 ...
## $ INT: Ord.factor w/ 18 levels "Capital Markets"<..: 2 9 1 8 5 5 4 4 16 4 ...
## $ RG : Ord.factor w/ 9 levels "UK"<"Other Europe"<..: 4 1 1 1 1 2 1 1 1 1 ...
## $ RS1: num [1:1000] 45 56 48 58 49 64 54 54 67 54 ...
## $ PM : num [1:1000] 2.11 0.79 1.62 0.09 1.46 0.94 1.32 1.09 1.3 1.32 ...
## $ SVM: num [1:1000] 10.29 11.42 5.63 10.17 10.6 ...
## $ PP : num [1:1000] 29 46 70 46 32 65 50 74 46 57 ...
## $ JB : num [1:1000] 66 50 50 56 54 52 68 48 56 63 ...
## $ LCC: Factor w/ 4 levels "1","2","3","4": 3 1 3 1 2 3 2 2 4 4 ...# #If there are "character" columns which should be "numeric"
bb$RS1 <- as.numeric(bb$RS1)
bb$PP <- as.numeric(bb$PP)
bb$JB <- as.numeric(bb$JB)
# #Equivalent
bb <- bb %>% mutate(across(c(RS1, PP, JB), as.numeric))Assume average sales of 8-million dollars and population standard deviation to be 2-million dollars. Check whether the average sales value in the population is at least 8 million dollars.
29.5 \(\text{\{Right or Upper\} } {H_0} : {\mu} \leq {\mu}_0 \iff {H_a}: {\mu} > {\mu}_0\)
mean(bb$SVM) \(\#\mathcal{R}\){8.0442 - 8} / {2 / sqrt(1000)} \(\#\mathcal{R}\){mean(bb$SVM) - 8} / {2 / sqrt(1000)} \(\#\mathcal{R}\)pnorm(q = 0.6988, lower.tail = FALSE) \(\#\mathcal{R}\)1 - pnorm(q = 0.6988, lower.tail = TRUE) \(\#\mathcal{R}\)1 - pnorm(q = 0.6988) \(\#\mathcal{R}\)pnorm(q = -0.6988) \(\#\mathcal{R}\)pnorm() provides the probability to the left of z-valueIf the population standard deviation is unknown, evaluate the same hypothesis again.
{mean(bb$SVM) - 8} / {sd(bb$SVM) / sqrt(1000)} \(\#\mathcal{R}\)pt(q = 0.7049, df = nrow(bb) - 1, lower.tail = FALSE) \(\#\mathcal{R}\)t.test(bb$SVM, mu = 8, alternative = "greater") \(\#\mathcal{R}\)Check whether the proportion of leads won by WSES is more than 50%.
29.26 \(\text{\{Right or Upper\} } {H_0} : {p} \leq {p}_0 \iff {H_a}: {p} > {p}_0\)
sqrt(0.50 * {1 - 0.50} / 1000) \(\#\mathcal{R}\){0.481 - 0.50}/{sqrt(0.50 * {1 - 0.50} / 1000)} \(\#\mathcal{R}\)pnorm(q = -1.2016, lower.tail = FALSE) \(\#\mathcal{R}\)1 - pnorm(q = -1.2016) \(\#\mathcal{R}\)# #Proportions
bb %>% group_by(SO) %>% summarise(PCT = n() / nrow(.))
## # A tibble: 2 x 2
## SO PCT
## <fct> <dbl>
## 1 0 0.519
## 2 1 0.481
#
pnorm(q = -1.2016, lower.tail = FALSE)
## [1] 0.8852407# #Grouped Percentages
# #table() gives a count
table(bb$SO)
##
## 0 1
## 519 481
#
# #prop.table() can work only with numbers so it needs table()
prop.table(table(bb$SO))
##
## 0 1
## 0.519 0.481
#
# #Similar
bb %>% group_by(SO) %>% summarise(N = n()) %>% mutate(PCT = N / sum(N))
## # A tibble: 2 x 3
## SO N PCT
## <fct> <int> <dbl>
## 1 0 519 0.519
## 2 1 481 0.481
bb %>% group_by(SO) %>% summarise(PCT = n() / nrow(.))
## # A tibble: 2 x 2
## SO PCT
## <fct> <dbl>
## 1 0 0.519
## 2 1 0.481Check whether the probability of winning a sales lead for the product “learnsys” is more than that of “Finsys.”
30.10 \(\text{\{Right or Upper\} } {H_0} : {p}_1 - {p}_2 \leq 0 \iff {H_a}: {p}_1 - {p}_2 > 0\)
# #Data | Subset | Filter | Update Factor levels
ii <- bb %>% select(PDT, SO, SVM) %>%
filter(PDT %in% c("LearnSys", "Finsys")) %>% mutate(across(PDT, factor))
str(ii)
## tibble [243 x 3] (S3: tbl_df/tbl/data.frame)
## $ PDT: Ord.factor w/ 2 levels "LearnSys"<"Finsys": 1 2 2 2 1 2 2 1 2 2 ...
## $ SO : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 2 1 ...
## $ SVM: num [1:243] 10.29 5.25 7.3 9.69 6.82 ...
#
# #Count
table(ii$PDT, ii$SO)
##
## 0 1
## LearnSys 55 71
## Finsys 83 34
#
# #Proportion Table
round(prop.table(table(ii$PDT, ii$SO), margin = 1), 3)
##
## 0 1
## LearnSys 0.437 0.563
## Finsys 0.709 0.291
#
prop.test(x = c(71, 34), n = c(126, 117), alternative = "greater")
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(71, 34) out of c(126, 117)
## X-squared = 17.316, df = 1, p-value = 1.583e-05
## alternative hypothesis: greater
## 95 percent confidence interval:
## 0.1644089 1.0000000
## sample estimates:
## prop 1 prop 2
## 0.5634921 0.2905983ii <- bb %>% select(PDT, SO, SVM) %>%
filter(PDT %in% c("LearnSys", "Finsys")) %>% mutate(across(PDT, factor))
str(ii)
## tibble [243 x 3] (S3: tbl_df/tbl/data.frame)
## $ PDT: Ord.factor w/ 2 levels "LearnSys"<"Finsys": 1 2 2 2 1 2 2 1 2 2 ...
## $ SO : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 2 1 ...
## $ SVM: num [1:243] 10.29 5.25 7.3 9.69 6.82 ...
#
# #Proportion Table: margin gives the margin to split by i.e.
# #1 means rowwise sum, 2 means columnwise
round(prop.table(table(ii$PDT, ii$SO), margin = 1), 3)
##
## 0 1
## LearnSys 0.437 0.563
## Finsys 0.709 0.291
round(prop.table(table(ii$PDT, ii$SO), margin = 2), 3)
##
## 0 1
## LearnSys 0.399 0.676
## Finsys 0.601 0.324
#
# #Similar
ii %>% select(PDT, SO) %>%
count(PDT, SO) %>% pivot_wider(names_from = SO, values_from = n) %>%
mutate(SUM = rowSums(across(where(is.numeric)))) %>%
mutate(across(where(is.numeric), ~ round(. * 100 /SUM, 1)))
## # A tibble: 2 x 4
## PDT `0` `1` SUM
## <ord> <dbl> <dbl> <dbl>
## 1 LearnSys 43.7 56.3 100
## 2 Finsys 70.9 29.1 100Check whether the average sales value of “learnsys” projects is higher than that of “Finsys” projects. (\({\alpha} = 0.05\))
30.3 \(\text{\{Right or Upper\} } {H_0} : {\mu}_1 - {\mu}_2 \leq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 > {D_0}\)
“ForLater”
str(ii)
## tibble [243 x 3] (S3: tbl_df/tbl/data.frame)
## $ PDT: Ord.factor w/ 2 levels "LearnSys"<"Finsys": 1 2 2 2 1 2 2 1 2 2 ...
## $ SO : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 2 1 ...
## $ SVM: num [1:243] 10.29 5.25 7.3 9.69 6.82 ...
ii_var <- var.test(SVM ~ PDT, data = ii)
if(ii_var$p.value > 0.05) {
cat(paste0("Variances are same. Pooled Test can be applied. \n"))
t.test(formula = SVM ~ PDT, data = ii, alternative = "greater", var.equal = TRUE)
} else {
cat(paste0("Variances are NOT same. Welch Test should be applied. \n"))
t.test(formula = SVM ~ PDT, data = ii, alternative = "greater", var.equal = FALSE)
}
## Variances are same. Pooled Test can be applied.
##
## Two Sample t-test
##
## data: SVM by PDT
## t = 0.93503, df = 241, p-value = 0.1754
## alternative hypothesis: true difference in means between group LearnSys and group Finsys is greater than 0
## 95 percent confidence interval:
## -0.1728648 Inf
## sample estimates:
## mean in group LearnSys mean in group Finsys
## 8.030476 7.804786jj <- ii %>% filter(SO == "1")
jj_var <- var.test(SVM ~ PDT, data = jj)
if(jj_var$p.value > 0.05) {
t.test(formula = SVM ~ PDT, data = jj, alternative = "greater", var.equal = TRUE)
} else {
cat(paste0("Problem: Difference of means can be tested only if the variances are same.\n"))
}
##
## Two Sample t-test
##
## data: SVM by PDT
## t = 0.62152, df = 103, p-value = 0.2678
## alternative hypothesis: true difference in means between group LearnSys and group Finsys is greater than 0
## 95 percent confidence interval:
## -0.3949429 Inf
## sample estimates:
## mean in group LearnSys mean in group Finsys
## 7.947887 7.711471Check whether there is a difference in the average profit across the geographical locations: United Kingdom, India and the Americas.
33.2 \(\text{\{ANOVA\}} {H_0} : {\mu}_1 = {\mu}_2 = \dots = {\mu}_k \iff {H_a}: \text{Not all population means are equal}\)
pf(q = 1.5, df1 = 2, df2 = 689, lower.tail = FALSE) \(\#\mathcal{R}\)ii <- bb %>% filter(RG %in% c("UK", "India", "Americas")) %>% select(RG, PP, SO)
str(ii)
## tibble [692 x 3] (S3: tbl_df/tbl/data.frame)
## $ RG: Ord.factor w/ 9 levels "UK"<"Other Europe"<..: 1 1 1 1 1 1 1 1 1 1 ...
## $ PP: num [1:692] 46 70 46 32 50 74 46 57 60 58 ...
## $ SO: Factor w/ 2 levels "0","1": 2 1 2 2 1 1 1 1 1 1 ...
#
# #ANOVA
ii_aov <- aov(formula = PP ~ RG, data = ii)
#
# #
model.tables(ii_aov, type = "means")
## Tables of means
## Grand mean
##
## 50.7052
##
## RG
## UK Americas India
## 50.6 50.33 53.51
## rep 553.0 104.00 35.00
#
# #Summary
summary(ii_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## RG 2 297 148.7 1.505 0.223
## Residuals 689 68075 98.8jj <- ii %>% filter(SO == "1")
str(jj)
## tibble [339 x 3] (S3: tbl_df/tbl/data.frame)
## $ RG: Ord.factor w/ 9 levels "UK"<"Other Europe"<..: 1 1 1 3 1 1 1 1 1 1 ...
## $ PP: num [1:339] 46 46 32 42 40 49 42 50 38 56 ...
## $ SO: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
#
# #ANOVA
jj_aov <- aov(formula = PP ~ RG, data = jj)
#
# #
model.tables(jj_aov, type = "means")
## Tables of means
## Grand mean
##
## 46.37168
##
## RG
## UK Americas India
## 46.23 46.36 48.65
## rep 267.00 55.00 17.00
#
# #Summary
summary(jj_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## RG 2 93 46.75 0.532 0.588
## Residuals 336 29508 87.82Check whether the sales conversions are different for different geographical locations.
ii <- bb %>% select(RG, SO)
str(ii)
## tibble [1,000 x 2] (S3: tbl_df/tbl/data.frame)
## $ RG: Ord.factor w/ 9 levels "UK"<"Other Europe"<..: 4 1 1 1 1 2 1 1 1 1 ...
## $ SO: Factor w/ 2 levels "0","1": 1 2 1 2 2 1 1 1 1 1 ...
#
round(prop.table(table(ii$RG, ii$SO), margin = 1), 3)
##
## 0 1
## UK 0.517 0.483
## Other Europe 0.582 0.418
## Americas 0.471 0.529
## Africa 0.409 0.591
## India 0.514 0.486
## Japan 0.375 0.625
## Singapore 0.739 0.261
## Spain 0.917 0.083
## Canada 0.333 0.667
#
# #Chi-Sq Test
tryCatch(chisq.test(table(ii$RG, ii$SO)),
warning = function(w) {
print(paste0(w))
suppressWarnings(chisq.test(table(ii$RG, ii$SO)))
})
## [1] "simpleWarning in chisq.test(table(ii$RG, ii$SO)): Chi-squared approximation may be incorrect\n"
##
## Pearson's Chi-squared test
##
## data: table(ii$RG, ii$SO)
## X-squared = 22.263, df = 8, p-value = 0.004452Check whether the sales conversions depend on the sales value. Check this claim by making 3 groups of Sales Value: <6-million, [6-8], >8-million dollar
Both are categorical variables, So ChiSq Test is required
\(P_{\chi^2} > {\alpha} \to {H_0}\) cannot be rejected.
ii <- bb %>% select(SO, SVM)
#
# #Create 3 Groups with middle group inclusive of both 6 & 8
ii$RSVM <- cut(ii$SVM, breaks = c(0, 5.9999, 8, 15), labels = 1:3)
#
summary(ii)
## SO SVM RSVM
## 0:519 Min. : 1.640 1:155
## 1:481 1st Qu.: 6.690 2:346
## Median : 8.000 3:499
## Mean : 8.044
## 3rd Qu.: 9.440
## Max. :14.230
#
table(ii$RSVM, ii$SO)
##
## 0 1
## 1 86 69
## 2 182 164
## 3 251 248
#
# #Chi-Sq Test
chisq.test(table(ii$RSVM, ii$SO))
##
## Pearson's Chi-squared test
##
## data: table(ii$RSVM, ii$SO)
## X-squared = 1.377, df = 2, p-value = 0.5023Please import the Jayalaxmi data.
# #Object Names for each sheet
namesJ <- c("xxJdata", "xxJbela", "xxJdhar", "xxJdiseases")
# #Dimensions of these datasets
str(lapply(namesJ, function(x) {dim(eval(parse(text = x)))}))
## List of 4
## $ : int [1:2] 123 26
## $ : int [1:2] 24 14
## $ : int [1:2] 22 14
## $ : int [1:2] 6 4Figure 7.1 JAT: Usage and Users over the TimePeriod
bb <- xxJdata %>% rename(Dates = "Month-Year", Users = "No of users") %>%
mutate(across(Dates, as_date)) %>% select(Dates, Week, Usage, Users)
#
ii <- bb %>%
mutate(NewDate = case_when(Week == "Week1" ~ Dates, Week == "Week2" ~ Dates + 7,
Week == "Week3" ~ Dates + 14, Week == "Week4" ~ Dates + 21)) %>%
mutate(DIFF = c(1, diff(NewDate)),
TF = ifelse(DIFF > 10, FALSE, TRUE),
grp = cumsum(DIFF > 10) + 1)
#
str(ii)
## tibble [123 x 8] (S3: tbl_df/tbl/data.frame)
## $ Dates : Date[1:123], format: "2015-06-01" "2015-07-01" "2015-07-01" ...
## $ Week : chr [1:123] "Week4" "Week1" "Week2" "Week3" ...
## $ Usage : num [1:123] 4 1 25 70 100 291 225 141 148 215 ...
## $ Users : num [1:123] 2 1 1 4 6 12 13 10 7 12 ...
## $ NewDate: Date[1:123], format: "2015-06-22" "2015-07-01" "2015-07-08" ...
## $ DIFF : num [1:123] 1 9 7 7 7 10 7 7 7 10 ...
## $ TF : logi [1:123] TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ grp : num [1:123] 1 1 1 1 1 1 1 1 1 1 ...# #Plot
hh <- ii
ttl_hh <- "Usage Pattern"
cap_hh <- "B15P01"
#
B15 <- hh %>% {
ggplot(data = ., aes(x = NewDate, y = Usage)) +
geom_point() +
#geom_smooth(aes(group = grp)) +
geom_line(aes(group = grp)) +
scale_x_date(date_breaks = "3 months",
labels = function(x) if_else(is.na(lag(x)) | !year(lag(x)) == year(x),
paste(month(x, label = TRUE), "\n", year(x)),
paste(month(x, label = TRUE)))) +
labs(x = "Time", y = "Usage",
caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, B15)
rm(B15)bb <- xxJdata %>% rename(Dates = "Month-Year", Users = "No of users") %>%
mutate(across(Dates, as_date)) %>% select(Dates, Week, Usage, Users)
#
ii <- bb %>%
mutate(NewDate = case_when(Week == "Week1" ~ Dates, Week == "Week2" ~ Dates + 7,
Week == "Week3" ~ Dates + 14, Week == "Week4" ~ Dates + 21)) %>%
mutate(DIFF = c(1, diff(NewDate)),
TF = ifelse(DIFF > 10, FALSE, TRUE),
grp = cumsum(DIFF > 10) + 1) %>%
mutate(grp_step = rle(TF)$lengths %>% {rep(seq(length(.)), .)})
#
jj <- c(1, which(ii$DIFF > 10), nrow(ii) + 1)
kk <- rep(1:length(diff(jj)), diff(jj))
stopifnot(identical(kk, as.integer(ii$grp)))
#
# #Where are the missing values
ii %>% filter( !TF | lead(!TF) | lag(!TF) )
## # A tibble: 3 x 9
## Dates Week Usage Users NewDate DIFF TF grp grp_step
## <date> <chr> <dbl> <dbl> <date> <dbl> <lgl> <dbl> <int>
## 1 2017-05-01 Week4 80 27 2017-05-22 7 TRUE 1 1
## 2 2017-09-01 Week4 1923 459 2017-09-22 123 FALSE 2 2
## 3 2017-10-01 Week1 1575 360 2017-10-01 9 TRUE 2 3Test the claim that disease 6 (leaf curl) information was accessed at least 60 times every month on average since October 2017 due to this disease outbreak. \(({\alpha} = 0.05)\)
NOTE: Actually the claim is “at least 60 times every week.” Month is a printing error.
29.5 \(\text{\{Right or Upper\} } {H_0} : {\mu} \leq {\mu}_0 \iff {H_a}: {\mu} > {\mu}_0\)
# #Data | Rename | Change from Date Time to Date
bb <- xxJdata %>%
rename(Dates = "Month-Year") %>%
#group_by(Dates) %>%
#summarise(D6 = sum(D6)) %>%
mutate(across(Dates, as_date))
#
# #Get relevant rows using filter() #xxJdata[95:123, ]
ii <- bb %>% filter(Dates >= "2017-10-01")
#
t_ii <- {mean(ii$D6) - 60} / {sd(ii$D6) / sqrt(nrow(ii))}
print(t_ii)
## [1] 2.341004
pt(t_ii, df = nrow(ii) - 1, lower.tail = FALSE)
## [1] 0.01329037
#
# #One Sample t-Test
t.test(ii$D6, mu = 60, alternative = "greater", conf.level = 0.05)
##
## One Sample t-test
##
## data: ii$D6
## t = 2.341, df = 28, p-value = 0.01329
## alternative hypothesis: true mean is greater than 60
## 5 percent confidence interval:
## 74.52782 Inf
## sample estimates:
## mean of x
## 68.41379# #Data | Rename | Sum Months D6 | Change from Date Time to Date
bb <- xxJdata %>%
rename(Dates = "Month-Year") %>%
group_by(Dates) %>%
summarise(D6 = sum(D6)) %>%
mutate(across(Dates, as_date))
#
# #There are missing months, but those months are not applicable in this question
# #Get relevant rows using filter()
ii <- bb %>% filter(Dates >= "2017-10-01")
#
t_ii <- {mean(ii$D6) - 60} / {sd(ii$D6) / sqrt(nrow(ii))}
print(t_ii)
## [1] 5.377075
pt(t_ii, df = nrow(ii) - 1, lower.tail = FALSE)
## [1] 0.000516808
#
# #One Sample t-Test
t.test(ii$D6, mu = 60, alternative = "greater", conf.level = 0.05)
##
## One Sample t-test
##
## data: ii$D6
## t = 5.3771, df = 7, p-value = 0.0005168
## alternative hypothesis: true mean is greater than 60
## 5 percent confidence interval:
## 314.2406 Inf
## sample estimates:
## mean of x
## 248str(bb)
## tibble [33 x 2] (S3: tbl_df/tbl/data.frame)
## $ Dates: Date[1:33], format: "2015-06-01" "2015-07-01" "2015-08-01" ...
## $ D6 : num [1:33] 0 6 31 41 74 104 91 88 74 88 ...
summary(bb)
## Dates D6
## Min. :2015-06-01 Min. : 0.0
## 1st Qu.:2016-02-01 1st Qu.: 71.0
## Median :2016-10-01 Median : 91.0
## Mean :2016-10-25 Mean :130.2
## 3rd Qu.:2017-09-01 3rd Qu.:198.0
## Max. :2018-05-01 Max. :365.0
#
# #Assuming each row is one month with no duplicates
stopifnot(identical(anyDuplicated(bb$Dates), 0L))
#
# #Create Sequence of Months
#ii <- seq(ymd("2015-6-1"), ymd("2018-5-1"), by = "months")
ii <- tibble(Dates = seq(min(bb$Dates), max(bb$Dates), by = "months"))
#
diff_len <- nrow(ii) - nrow(bb)
#
if(!identical(diff_len, 0L)) {
cat(paste0("Number of missing months = ", diff_len,"\n"))
#
# #Find Values that should be in Complete Sequence but are missing in the data
as_date(setdiff(ii$Dates, bb$Dates))
# #OR
ii %>% anti_join(bb)
#
# #This does not need a separate Vector of all Months
# #Get Months Difference using Integer Division and
# #Filter Rows which are not consecutive and rows above them
bb %>%
mutate(diff_months = (interval(lag(Dates), Dates)) %/% months(1)) %>%
filter( (diff_months != 1) | lead(diff_months != 1) )
}
## Number of missing months = 3
## # A tibble: 2 x 3
## Dates D6 diff_months
## <date> <dbl> <dbl>
## 1 2017-05-01 51 1
## 2 2017-09-01 111 4# #Fill Missing Months
jj <- as_tibble(merge(bb, ii, by = "Dates", all = TRUE))
kk <- right_join(bb, ii, by = "Dates") %>% arrange(Dates)
stopifnot(identical(jj, kk))
# #Replace NA
ll <- kk %>% mutate(across(D6, coalesce, 0)) Test the claim that Among the app users for disease information, at least 15% of them access disease information related to disease 6. \(({\alpha} = 0.05)\)
29.5 \(\text{\{Right or Upper\} } {H_0} : {\mu} \leq {\mu}_0 \iff {H_a}: {\mu} > {\mu}_0\)
Determine whether the proportion of farmers searching for D6 is more than \(p_0 = 0.15\)
29.26 \(\text{\{Right or Upper\} } {H_0} : {p} \leq {p}_0 \iff {H_a}: {p} > {p}_0\)
# #Data | Sum Disease, Variety, Micronutrients
aa <- xxJdata %>%
mutate(sumD = rowSums(across(starts_with("D"))),
sumV = rowSums(across(starts_with("V")))) %>%
rename(Dates = "Month-Year", Users = "No of users", Micro = "Micronutrient") %>%
mutate(SUM = rowSums(across(c(sumD, sumV, Micro))),
DIFF = Usage - SUM) %>%
select(Dates, Users, Usage, SUM, DIFF, sumD, sumV, Micro, D6) %>%
mutate(across(Dates, as_date)) %>%
mutate(Fraction = D6/sumD)
#
# #Confirmed that Usage is Sum Total of Disease, Variety, Micronutrients
unique(aa$DIFF)
## [1] 0
#
# #Working Set | Exclude 1 NA i.e. where sumD is zero
bb <- aa %>% drop_na(Fraction) %>% select(Usage, sumD, D6, Fraction)
#
# #Check n (Sample Count) and x (Count of Success)
bb %>% summarise(across(c(sumD, D6), sum))
## # A tibble: 1 x 2
## sumD D6
## <dbl> <dbl>
## 1 26830 4295
#
# #One Sample Proportion Test with continuity correction
bb_prop <- prop.test(x = sum(bb$D6), n = sum(bb$sumD), p = 0.15,
alternative = "greater", conf.level = 0.95)
bb_prop
##
## 1-sample proportions test with continuity correction
##
## data: sum(bb$D6) out of sum(bb$sumD), null probability 0.15
## X-squared = 21.311, df = 1, p-value = 1.953e-06
## alternative hypothesis: true p is greater than 0.15
## 95 percent confidence interval:
## 0.1564156 1.0000000
## sample estimates:
## p
## 0.160082# #Impact if we try to evaluate propotion of D6 searches out of ALL Usage (Disease, Variety, Micro)
# #Check n (Sample Count) and x (Count of Success)
bb %>% summarise(across(c(Usage, D6), sum))
## # A tibble: 1 x 2
## Usage D6
## <dbl> <dbl>
## 1 71646 4295
#
# #One Sample Proportion Test with continuity correction
# #With p-value = 1, we cannot claim that 15% searches are for D6 only out of ALL Usage
prop.test(x = sum(bb$D6), n = sum(bb$Usage), p = 0.15,
alternative = "greater", conf.level = 0.95)
##
## 1-sample proportions test with continuity correction
##
## data: sum(bb$D6) out of sum(bb$Usage), null probability 0.15
## X-squared = 4556.2, df = 1, p-value = 1
## alternative hypothesis: true p is greater than 0.15
## 95 percent confidence interval:
## 0.05849838 1.00000000
## sample estimates:
## p
## 0.05994752# #One Sample t-Test (Wrong)
if(FALSE) {
t.test(bb$Fraction, mu = 0.15, alternative = "greater", conf.level = 0.05)
}Test the claim that the average number of users in year 2017-2018 is more than average number of users in year 2015-2016. \(({\alpha} = 0.05)\)
30.3 \(\text{\{Right or Upper\} } {H_0} : {\mu}_1 - {\mu}_2 \leq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 > {D_0}\)
# #Data
bb <- xxJdata %>%
rename(Dates = "Month-Year", Users = "No of users") %>%
mutate(across(Dates, as_date)) %>%
select(Dates, Users, Usage)
#
# #Missing Months
ii <- tibble(Dates = seq(min(bb$Dates), max(bb$Dates), by = "months"))
jj <- right_join(bb, ii, by = "Dates") %>% arrange(Dates) %>% mutate(across(Users, coalesce, 0))
#
# #Create 2 Groups
jj$Year <- cut(jj$Dates, breaks = c(min(ii$Dates), as_date("2017-01-01"), Inf),
labels = c("2015-2016", "2017-2018"))
#
# #Verify Changes
jj[!duplicated(jj$Year), ]
## # A tibble: 2 x 4
## Dates Users Usage Year
## <date> <dbl> <dbl> <fct>
## 1 2015-06-01 2 4 2015-2016
## 2 2017-01-01 92 495 2017-2018
jj %>% filter(Dates %in% ymd(c("2016-12-01", "2017-01-01")))
## # A tibble: 8 x 4
## Dates Users Usage Year
## <date> <dbl> <dbl> <fct>
## 1 2016-12-01 50 536 2015-2016
## 2 2016-12-01 54 318 2015-2016
## 3 2016-12-01 99 558 2015-2016
## 4 2016-12-01 104 573 2015-2016
## 5 2017-01-01 92 495 2017-2018
## 6 2017-01-01 130 578 2017-2018
## 7 2017-01-01 87 436 2017-2018
## 8 2017-01-01 60 261 2017-2018
#
# #For Two Sample t-test, check if Variances are equal
jj_var <- var.test(Users ~ Year, data = jj)
jj_var
##
## F test to compare two variances
##
## data: Users by Year
## F = 0.10358, num df = 72, denom df = 52, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.06158444 0.17055359
## sample estimates:
## ratio of variances
## 0.1035754
#
# #If Variances are Equal, Pooled Test otherwise Welch Test
isVarEqual <- ifelse(jj_var$p.value > 0.05, TRUE, FALSE)
#
# #Because 1 is "2015-2016", 2 is "2017-2018", we need to perform Lower Tail Test
jj_t <- t.test(formula = Users ~ Year, data = jj, alternative = "less", var.equal = isVarEqual)
jj_t
##
## Welch Two Sample t-test
##
## data: Users by Year
## t = -7.255, df = 59.87, p-value = 4.641e-10
## alternative hypothesis: true difference in means between group 2015-2016 and group 2017-2018 is less than 0
## 95 percent confidence interval:
## -Inf -92.96694
## sample estimates:
## mean in group 2015-2016 mean in group 2017-2018
## 50.06849 170.84906
#
# #Alternatively, we can reverse Factor levels to perform Upper Tail Test
kk <- jj
kk$Year <- factor(kk$Year, levels = rev(levels(jj$Year)))
#
t.test(formula = Users ~ Year, data = kk, alternative = "greater", var.equal = isVarEqual)
##
## Welch Two Sample t-test
##
## data: Users by Year
## t = 7.255, df = 59.87, p-value = 4.641e-10
## alternative hypothesis: true difference in means between group 2017-2018 and group 2015-2016 is greater than 0
## 95 percent confidence interval:
## 92.96694 Inf
## sample estimates:
## mean in group 2017-2018 mean in group 2015-2016
## 170.84906 50.06849# #Data
str(jj)
## tibble [126 x 4] (S3: tbl_df/tbl/data.frame)
## $ Dates: Date[1:126], format: "2015-06-01" "2015-07-01" "2015-07-01" ...
## $ Users: num [1:126] 2 1 1 4 6 12 13 10 7 12 ...
## $ Usage: num [1:126] 4 1 25 70 100 291 225 141 148 215 ...
## $ Year : Factor w/ 2 levels "2015-2016","2017-2018": 1 1 1 1 1 1 1 1 1 1 ...
# #For Two Sample t-test, check if Variances are equal
jj_var <- var.test(Usage ~ Year, data = jj)
jj_var
##
## F test to compare two variances
##
## data: Usage by Year
## F = 0.8706, num df = 72, denom df = 49, p-value = 0.5853
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.5117114 1.4434337
## sample estimates:
## ratio of variances
## 0.8705962
#
# #If Variances are Equal, Pooled Test otherwise Welch Test
isVarEqual <- ifelse(jj_var$p.value > 0.05, TRUE, FALSE)
#
# #Because 1 is "2015-2016", 2 is "2017-2018", we need to perform Lower Tail Test
t.test(formula = Usage ~ Year, data = jj, alternative = "less", var.equal = isVarEqual)
##
## Two Sample t-test
##
## data: Usage by Year
## t = -4.7369, df = 121, p-value = 2.981e-06
## alternative hypothesis: true difference in means between group 2015-2016 and group 2017-2018 is less than 0
## 95 percent confidence interval:
## -Inf -252.37
## sample estimates:
## mean in group 2015-2016 mean in group 2017-2018
## 424.6849 812.9000Check whether app usage is same or different across the four weeks of a month. Test the claim that app usage picked up after January 2016. (Answered with Q3)
NOTE: Question is ‘check whether app usage is same or different across the four weeks of a month, using Jan-2016 - May-2018 data.’ However, as seen in the figure 7.1, this time period has 3 months missing data and completely different usage pattern after that, I believe that testing only this data would give biased results. So, this was not done.
33.2 \(\text{\{ANOVA\}} {H_0} : {\mu}_1 = {\mu}_2 = \dots = {\mu}_k \iff {H_a}: \text{Not all population means are equal}\)
Question: When ANOVA is done on transformed data and a conclusion is reached. Does this imply that the original data would also follow same conclusion - Look at the p-value of Transformed data for accepting or rejecting the Hypothesis. But, look at the mean of original data to apply those conclusions. - ANOVA p-value is NOT trustworthy if the data is NOT Normal.
Question: When we are running any test, should we check whether the data is normal - Yes.
Figure 7.2 JAT: QQ Plot of Usage and Sqrt(Usage)
# #Data | Missing Months can be ignored because those are missing across all weeks
bb <- xxJdata %>%
rename(Dates = "Month-Year", Users = "No of users") %>%
select(Week, Usage)
#
str(bb)
## tibble [123 x 2] (S3: tbl_df/tbl/data.frame)
## $ Week : chr [1:123] "Week4" "Week1" "Week2" "Week3" ...
## $ Usage: num [1:123] 4 1 25 70 100 291 225 141 148 215 ...
summary(bb)
## Week Usage
## Length:123 Min. : 1.0
## Class :character 1st Qu.: 286.5
## Mode :character Median : 450.0
## Mean : 582.5
## 3rd Qu.: 749.5
## Max. :3462.0
#
# #ANOVA (on original data : neither normal, nor of equal variance)
bb_aov <- aov(formula = Usage ~ Week, data = bb)
#
# #
model.tables(bb_aov, type = "means")
## Tables of means
## Grand mean
##
## 582.4959
##
## Week
## Week1 Week2 Week3 Week4
## 551.9 522.2 480 764.7
## rep 31.0 30.0 30 32.0
#
# #Summary
summary(bb_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## Week 3 1515178 505059 2.22 0.0894 .
## Residuals 119 27074553 227517
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
bb_aov
## Call:
## aov(formula = Usage ~ Week, data = bb)
##
## Terms:
## Week Residuals
## Sum of Squares 1515178 27074553
## Deg. of Freedom 3 119
##
## Residual standard error: 476.9877
## Estimated effects may be unbalanced
#
kruskal.test(Usage ~ Week, data = bb)
##
## Kruskal-Wallis rank sum test
##
## data: Usage by Week
## Kruskal-Wallis chi-squared = 3.1614, df = 3, p-value = 0.3674
#
# #Poisson Test (ForLater)
#anova(glm(Usage ~ Week, data = ii, family = poisson), test = "LRT")
#
# #Transformation: Square Root Data
ii <- bb %>% mutate(Week = factor(Week)) %>% mutate(Sqrt = sqrt(Usage))
str(ii)
## tibble [123 x 3] (S3: tbl_df/tbl/data.frame)
## $ Week : Factor w/ 4 levels "Week1","Week2",..: 4 1 2 3 4 1 2 3 4 1 ...
## $ Usage: num [1:123] 4 1 25 70 100 291 225 141 148 215 ...
## $ Sqrt : num [1:123] 2 1 5 8.37 10 ...
summary(ii)
## Week Usage Sqrt
## Week1:31 Min. : 1.0 Min. : 1.00
## Week2:30 1st Qu.: 286.5 1st Qu.:16.93
## Week3:30 Median : 450.0 Median :21.21
## Week4:32 Mean : 582.5 Mean :22.35
## 3rd Qu.: 749.5 3rd Qu.:27.38
## Max. :3462.0 Max. :58.84
#
# #ANOVA
ii_aov <- aov(formula = Sqrt ~ Week, data = ii)
# #
model.tables(ii_aov, type = "means")
## Tables of means
## Grand mean
##
## 22.35259
##
## Week
## Week1 Week2 Week3 Week4
## 21.98 21.57 20.74 24.95
## rep 31.00 30.00 30.00 32.00
#
# #Summary
summary(ii_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## Week 3 317 105.75 1.274 0.286
## Residuals 119 9874 82.98
#
kruskal.test(Sqrt ~ Week, data = ii)
##
## Kruskal-Wallis rank sum test
##
## data: Sqrt by Week
## Kruskal-Wallis chi-squared = 3.1614, df = 3, p-value = 0.3674Statistical tests for comparing the variances of two or more samples. Equal variances across samples is called homogeneity of variances.
# #Data
str(bb)
## tibble [123 x 2] (S3: tbl_df/tbl/data.frame)
## $ Week : chr [1:123] "Week4" "Week1" "Week2" "Week3" ...
## $ Usage: num [1:123] 4 1 25 70 100 291 225 141 148 215 ...
summary(bb)
## Week Usage
## Length:123 Min. : 1.0
## Class :character 1st Qu.: 286.5
## Mode :character Median : 450.0
## Mean : 582.5
## 3rd Qu.: 749.5
## Max. :3462.0
#
# #Bartlett Test
bartlett.test(Usage ~ Week, data = bb)
##
## Bartlett test of homogeneity of variances
##
## data: Usage by Week
## Bartlett's K-squared = 25.89, df = 3, p-value = 1.006e-05
#
# #Levene Test
ii <- bb %>% mutate(Week = factor(Week))
leveneTest(Usage ~ Week, data = ii)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 3 2.8691 0.0394 *
## 119
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# #Fligner-Killeen test
fligner.test(Usage ~ Week, data = bb)
##
## Fligner-Killeen test of homogeneity of variances
##
## data: Usage by Week
## Fligner-Killeen:med chi-squared = 8.5854, df = 3, p-value = 0.03534
#
# #Transformation: Square Root Data
ii <- bb %>% mutate(Week = factor(Week)) %>% mutate(Sqrt = sqrt(Usage))
bartlett.test(Sqrt ~ Week, data = ii)
##
## Bartlett test of homogeneity of variances
##
## data: Sqrt by Week
## Bartlett's K-squared = 10.656, df = 3, p-value = 0.01374
leveneTest(Sqrt ~ Week, data = ii)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 3 2.4529 0.06668 .
## 119
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fligner.test(Sqrt ~ Week, data = ii)
##
## Fligner-Killeen test of homogeneity of variances
##
## data: Sqrt by Week
## Fligner-Killeen:med chi-squared = 6.6688, df = 3, p-value = 0.08324# #Are the data from each of the 4 groups follow a normal distribution
# #Shapiro-Wilk normality test
bb %>% mutate(Week = factor(Week)) %>%
group_by(Week) %>%
summarise(N = n(), Mean = mean(Usage), SD = sd(Usage),
p_Shapiro = shapiro.test(Usage)$p.value,
isNormal = ifelse(p_Shapiro > 0.05, TRUE, FALSE))
## # A tibble: 4 x 6
## Week N Mean SD p_Shapiro isNormal
## <fct> <int> <dbl> <dbl> <dbl> <lgl>
## 1 Week1 31 552. 384. 0.00631 FALSE
## 2 Week2 30 522. 353. 0.0173 FALSE
## 3 Week3 30 480. 333. 0.000843 FALSE
## 4 Week4 32 765. 714. 0.0000964 FALSE
# #Check Q-Q plot
#qqnorm(bb[bb$Week == "Week1", ]$Usage)
#
# #Transformation: Log (Did not pass Normality)
bb %>% mutate(Week = factor(Week)) %>%
mutate(Log = log(Usage)) %>%
group_by(Week) %>%
summarise(p_Shapiro = shapiro.test(Log)$p.value,
isNormal = ifelse(p_Shapiro > 0.05, TRUE, FALSE))
## # A tibble: 4 x 3
## Week p_Shapiro isNormal
## <fct> <dbl> <lgl>
## 1 Week1 0.000000533 FALSE
## 2 Week2 0.0314 FALSE
## 3 Week3 0.689 TRUE
## 4 Week4 0.00125 FALSE
#
# #Transformation: Square Root (Success: Passed Normality) - Selected
bb %>% mutate(Week = factor(Week)) %>%
mutate(Sqrt = sqrt(Usage)) %>%
group_by(Week) %>%
summarise(p_Shapiro = shapiro.test(Sqrt)$p.value,
isNormal = ifelse(p_Shapiro > 0.05, TRUE, FALSE))
## # A tibble: 4 x 3
## Week p_Shapiro isNormal
## <fct> <dbl> <lgl>
## 1 Week1 0.493 TRUE
## 2 Week2 0.967 TRUE
## 3 Week3 0.108 TRUE
## 4 Week4 0.732 TRUE
#
# #Transformation: Cube Root (Success: Passed Normality) Just to check
bb %>% mutate(Week = factor(Week)) %>%
mutate(CubeRoot = Usage^(1/3)) %>%
group_by(Week) %>%
summarise(p_Shapiro = shapiro.test(CubeRoot)$p.value,
isNormal = ifelse(p_Shapiro > 0.05, TRUE, FALSE))
## # A tibble: 4 x 3
## Week p_Shapiro isNormal
## <fct> <dbl> <lgl>
## 1 Week1 0.0979 TRUE
## 2 Week2 0.980 TRUE
## 3 Week3 0.359 TRUE
## 4 Week4 0.997 TRUE
#
# #Testing Residuals i.e. Data - Group Mean (Did not pass Normality)
bb %>% mutate(Week = factor(Week)) %>%
group_by(Week) %>%
mutate(Residuals = Usage - mean(Usage)) %>%
summarise(p_Shapiro = shapiro.test(Residuals)$p.value,
isNormal = ifelse(p_Shapiro > 0.05, TRUE, FALSE))
## # A tibble: 4 x 3
## Week p_Shapiro isNormal
## <fct> <dbl> <lgl>
## 1 Week1 0.00631 FALSE
## 2 Week2 0.0173 FALSE
## 3 Week3 0.000843 FALSE
## 4 Week4 0.0000964 FALSEbb <- xxJdata %>%
rename(Dates = "Month-Year", Users = "No of users") %>%
mutate(Week = factor(Week)) %>%
mutate(Sqrt = sqrt(Usage)) %>%
select(Week, Usage, Sqrt)
#
hh <- bb
ttl_hh <- "QQ Plot of Usage"
cap_hh <- "B15P04"
#
B15 <- hh %>% { ggplot(., aes(sample = Usage, colour = Week)) +
stat_qq() +
stat_qq_line() +
labs(caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, B15)
rm(B15)
#
ttl_hh <- "QQ Plot of Sqrt(Usage)"
cap_hh <- "B15P05"
B15 <- hh %>% { ggplot(., aes(sample = Sqrt, colour = Week)) +
stat_qq() +
stat_qq_line() +
labs(caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, B15)
rm(B15)30.2 \(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\mu}_1 - {\mu}_2 \geq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 < {D_0}\)
A new version of the app was released in August-2016. Which month in the given time frame after the launch of the new version, the mean usage pattern would start to show a statistically significant shift
# #Data
bb <- xxJdata %>% rename(Dates = "Month-Year") %>% mutate(across(Dates, as_date)) %>%
select(Dates, Week, Usage)
#
# #Create 2 Groups
bb$Year <- cut(bb$Dates, breaks = c(min(bb$Dates), as_date("2016-08-01"), Inf),
labels = c("OldApp", "NewApp"))
#
# #For Two Sample t-test, check if Variances are equal
bb_var <- var.test(Usage ~ Year, data = bb)
bb_var
##
## F test to compare two variances
##
## data: Usage by Year
## F = 0.16799, num df = 52, denom df = 69, p-value = 4.775e-10
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.1014182 0.2835174
## sample estimates:
## ratio of variances
## 0.1679867
#
# #If Variances are Equal, Pooled Test otherwise Welch Test
isVarEqual <- ifelse(bb_var$p.value > 0.05, TRUE, FALSE)
#
# #Because 1 is "OldApp", 2 is "NewApp", we need to perform Lower Tail Test
bb_t <- t.test(formula = Usage ~ Year, data = bb, alternative = "less", var.equal = isVarEqual)
bb_t
##
## Welch Two Sample t-test
##
## data: Usage by Year
## t = -6.0938, df = 96.698, p-value = 1.124e-08
## alternative hypothesis: true difference in means between group OldApp and group NewApp is less than 0
## 95 percent confidence interval:
## -Inf -317.4977
## sample estimates:
## mean in group OldApp mean in group NewApp
## 334.1132 770.5571
## #Rolling Sums
old_sd <- sd(bb[bb$Year == "OldApp", ]$Usage)
old_n <- summary(bb$Year)[1]
#
ii <- bb %>% filter(Year == "NewApp", Dates <= '2017-05-01') %>% select(-Year)
#
jj <- ii %>% mutate(ID = row_number(), cSUM = cumsum(Usage), cMean = cSUM/ID) %>%
mutate(SD = across(Usage, ~ rollapply(., ID, sd, fill = NA, align = "right"))) %>%
mutate(DOF = floor({SD^2 / ID + old_sd^2 / old_n }^2 /
{{SD^2 / ID}^2/{ID-1} + {old_sd^2 / old_n}^2/{old_n-1}})) %>%
mutate(Sigma = sqrt({SD^2 /ID + old_sd^2 /old_n}))
str(jj)
## tibble [40 x 9] (S3: tbl_df/tbl/data.frame)
## $ Dates: Date[1:40], format: "2016-08-01" "2016-08-01" "2016-08-01" ...
## $ Week : chr [1:40] "Week1" "Week2" "Week3" "Week4" ...
## $ Usage: num [1:40] 421 387 264 788 691 256 261 377 295 749 ...
## $ ID : int [1:40] 1 2 3 4 5 6 7 8 9 10 ...
## $ cSUM : num [1:40] 421 808 1072 1860 2551 ...
## $ cMean: num [1:40] 421 404 357 465 510 ...
## $ SD : tibble [40 x 1] (S3: tbl_df/tbl/data.frame)
## ..$ Usage: num [1:40] NA 24 82.6 225.6 220 ...
## $ DOF :'data.frame': 40 obs. of 1 variable:
## ..$ Usage: num [1:40] NA 14 3 3 4 6 7 9 11 13 ...
## $ Sigma:'data.frame': 40 obs. of 1 variable:
## ..$ Usage: num [1:40] NA 34.9 56.6 116.9 103 ...If a disease is likely to spread in particular weather condition (data given in the disease index sheet), then the access of that disease should be more in the months having suitable weather conditions. Help the analyst in coming up with a statistical test to support the claim for two districts for which the sample of weather and disease access data is provided in the data sheet. Identify the diseases for which you can support this claim. Test this claim both for temperature and relative humidity at 95% confidence.
30.3 \(\text{\{Right or Upper\} } {H_0} : {\mu}_1 - {\mu}_2 \leq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 > {D_0}\)
| p- Var | Equal Var | p- t-test | Upper H0 | |
|---|---|---|---|---|
| D1 ~ iD1 | 0.00002 | FALSE | 0.00084 | Rejected |
| D2 ~ iD2 | 0.00009 | FALSE | 0.00076 | Rejected |
| D3 ~ iD3 | 0.0002 | FALSE | 0.00399 | Rejected |
| D4 ~ iD4 | 0.00029 | FALSE | 0.06659 | Not Rejected |
| D5 ~ iD5 | 0.37691 | TRUE | 0.01814 | Rejected |
| D7 ~ iD7 | 0.90392 | TRUE | 0.00239 | Rejected |
Figure 7.3 JAT: Disease Searches grouped with Faourable and unfavourable Condisions (T and RH)
# #Merge both dataframes of Two districts
aa <- bind_rows(Belagavi = xxJbela, Dharwad = xxJdhar, .id = 'source') %>%
rename(Dates = Months, RH = "Relative Humidity", TMP = "Temperature") %>%
mutate(across(Dates, as_date)) %>% mutate(source = factor(source)) %>%
select(-c(10:13)) %>% select(-D6)
#
# #Based on Conditional T & RH, get each disease favourable condition = TRUE
q6_bb <- aa %>% mutate(iD1 = ifelse(TMP <= 24 & TMP >= 20 & RH > 80, TRUE, FALSE),
iD2 = ifelse(TMP <= 24.5 & TMP >= 21.5 & RH > 83, TRUE, FALSE),
iD3 = ifelse(TMP <= 24 & TMP >= 22, TRUE, FALSE),
iD4 = ifelse(TMP <= 26 & TMP >= 22 & RH > 85, TRUE, FALSE),
iD5 = ifelse(TMP <= 24.5 & TMP >= 22 & RH <= 85 & RH >= 77, TRUE, FALSE),
iD7 = ifelse(TMP > 25 & RH > 80, TRUE, FALSE)) %>%
mutate(across(starts_with("i"), factor, levels = c(TRUE, FALSE)))
bb <- q6_bb
#
# #Create all Formulae for variance and t-test
formulas <- paste0(names(bb)[3:8], " ~ ", names(bb)[11:16])
#
# #Appply formulae
output <- t(sapply(formulas, function(f) {
test_var <- var.test(as.formula(f), data = bb)
isVarEqual <- ifelse(test_var$p.value > 0.05, TRUE, FALSE)
test_t <- t.test(as.formula(f), data = bb, alternative = "greater", var.equal = isVarEqual)
c("p- Var" = format(round(test_var$p.value, 5), scientific = FALSE),
"Equal Var" = ifelse(test_var$p.value > 0.05, TRUE, FALSE),
"p- t-test" = format(round(test_t$p.value, 5), scientific = FALSE),
"Upper H0" = ifelse(test_t$p.value > 0.05, "Not Rejected", "Rejected"))
}))bb <- q6_bb
hh <- q6_bb %>%
rename_with(~gsub("iD", "i", .x)) %>%
select(starts_with(c("D", "i"))) %>%
select(-Dates) %>%
pivot_longer(everything(), names_to = c(".value", "Disease"), names_pattern = "(.)(.)") %>%
rename(Values = "D", Favourable = "i")
#
ttl_hh <- "BoxPlot of Searches for Diseases in both districts"
cap_hh <- "B15P03"
#
B15 <- hh %>% { ggplot(data = ., mapping = aes(x = Disease, y = Values, fill = Favourable)) +
geom_boxplot(outlier.shape = NA) +
#stat_summary(fun = mean, geom = "point", size = 2, color = "steelblue") +
#scale_y_continuous(breaks = seq(0, 110, 10), limits = c(0, 110)) +
geom_point(position = position_jitterdodge(jitter.width = 0.1),
size = 1, alpha = 0.7, colour = "#21908CFF") +
k_gglayer_box +
theme(
#legend.justification = c("right", "top"),
#legend.box.just = "right",
#legend.margin = margin(6, 6, 6, 6),
legend.position = c(.90, .95)
) +
labs(x = "Diseases", y = "Searches per month", fill = "Favourable",
caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, B15)
rm(B15)# #rename_with() uses formula
# #Selection Helpers like starts_with() can take multiple conditions
# #pivot_longer() can return multiple groups
# #Pattern Match: Both First and Second Pattern can contain only 1 character
ii <- q6_bb %>%
rename_with(~gsub("iD", "i", .x)) %>%
select(starts_with(c("D", "i"))) %>% select(-Dates) %>%
pivot_longer(everything(), names_to = c(".value", "Disease"), names_pattern = "(.)(.)") %>%
rename(Values = "D", Favourable = "i")
#
# #First Pattern can contain 1 or more characters but the Second can have only 1 character
jj <- q6_bb %>%
select(starts_with(c("D", "i"))) %>% select(-Dates) %>%
pivot_longer(everything(), names_to = c(".value", "Disease"), names_pattern = "(.*)(.)") %>%
rename(Values = "D", Favourable = iD)
stopifnot(identical(ii, jj))if(FALSE){# #WARNING: Installation may take some time.
install.packages("mice", dependencies = TRUE)
install.packages("car", dependencies = TRUE)
}Please import the "B16-Cars2.csv".
| sl.No. | mpg | cylinders | cubicinches | hp | weightlbs | time.to.60 | year | brand |
|---|---|---|---|---|---|---|---|---|
| 1 | 14.0 | 8 | 350 | 165 | 4209 | 12 | 1972 | US |
| 2 | 31.9 | 4 | 89 | 71 | 1925 | 14 | 1980 | Europe |
| 3 | 17.0 | 8 | 302 | 140 | 3449 | 11 | 1971 | US |
| 4 | 15.0 | 8 | 400 | 150 | 3761 | 10 | 1971 | US |
| 5 | 30.5 | 4 | 98 | 63 | 2051 | 17 | 1978 | US |
| 6 | 23.0 | 8 | 350 | 125 | 3900 | 17 | 1980 | US |
# #Structure
str(xxB16Cars)
## spec_tbl_df [263 x 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ sl.No. : num [1:263] 1 2 3 4 5 6 7 8 9 10 ...
## $ mpg : num [1:263] 14 31.9 17 15 30.5 23 13 14 25.4 37.7 ...
## $ cylinders : num [1:263] 8 4 8 8 4 8 8 8 5 4 ...
## $ cubicinches: num [1:263] 350 89 302 400 98 350 351 440 183 89 ...
## $ hp : num [1:263] 165 71 140 150 63 125 158 215 77 62 ...
## $ weightlbs : num [1:263] 4209 1925 3449 3761 2051 ...
## $ time.to.60 : num [1:263] 12 14 11 10 17 17 13 9 20 17 ...
## $ year : num [1:263] 1972 1980 1971 1971 1978 ...
## $ brand : chr [1:263] "US" "Europe" "US" "US" ...# #Summary
summary(xxB16Cars)
## sl.No. mpg cylinders cubicinches hp weightlbs
## Min. : 1.0 Min. : 10.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. : 192.5
## 1st Qu.: 66.5 1st Qu.: 16.95 1st Qu.:4.000 1st Qu.:103.0 1st Qu.: 75.5 1st Qu.:2245.5
## Median :132.0 Median : 22.00 Median :6.000 Median :156.0 Median : 95.0 Median :2830.0
## Mean :132.0 Mean : 25.07 Mean :5.593 Mean :201.5 Mean :106.3 Mean :2992.9
## 3rd Qu.:197.5 3rd Qu.: 28.90 3rd Qu.:8.000 3rd Qu.:302.0 3rd Qu.:137.5 3rd Qu.:3654.5
## Max. :263.0 Max. :527.00 Max. :8.000 Max. :455.0 Max. :230.0 Max. :4997.0
## time.to.60 year brand
## Min. : 8.00 Min. :1971 Length:263
## 1st Qu.:14.00 1st Qu.:1974 Class :character
## Median :16.00 Median :1977 Mode :character
## Mean :15.54 Mean :1977
## 3rd Qu.:17.00 3rd Qu.:1980
## Max. :25.00 Max. :1983aa <- xxB16Cars #No missing value
bb <- aa #Will have missing value later
#
# #Identify the Number of Missing Values
if(anyNA(bb)) {
cat(paste0("NA are Present! Total NA = ", sum(is.na(bb)), "\n"))
} else cat(paste0("NA not found.\n"))
## NA not found.
#
# #Record Some Values, before deleteting them
bb_22 <- bb$mpg[2] #bb[2, 2] #31.9
bb_39 <- bb$brand[3] #bb[3, 9] #"US"
bb_43 <- bb$cylinders[4] #bb[4, 3] #8
#
# #Delete
bb$mpg[2] <- bb$brand[3] <- bb$cylinders[4] <- NA
#
# #Identify the Number of Missing Values
cat(paste0("NA are Present! Total NA = ", sum(is.na(bb)), "\n"))
## NA are Present! Total NA = 3
#
# #Which Columns have NA
#summary(bb)
bb_na_col <- colSums(is.na(bb))
#
# #Column Names with their Column Index
which(bb_na_col != 0)
## mpg cylinders brand
## 2 3 9
#
# #Number of NA in each Column
bb_na_col[which(bb_na_col != 0)]
## mpg cylinders brand
## 1 1 1
#
# #How many rows contain NA
sum(!complete.cases(bb))
## [1] 3
#
# #Indices of Rows with NA
head(which(!complete.cases(bb)))
## [1] 2 3 4
#ii <- bb
summary(ii$brand)
## Length Class Mode
## 263 character character
summary(factor(ii$brand))
## Europe Japan US NA's
## 48 51 163 1
#
# #table() by default does not show NA even in factor. However it has 'useNA' option
table(ii$brand)
##
## Europe Japan US
## 48 51 163
table(factor(ii$brand))
##
## Europe Japan US
## 48 51 163
table(ii$brand, useNA = "always")
##
## Europe Japan US <NA>
## 48 51 163 1ii <- bb
# #Mean Replacement
#ii %>% mutate(across(mpg, ~ replace(., is.na(.), round(mean(mpg, na.rm = TRUE), 2))))
ii$mpg[which(is.na(ii$mpg))] <- round(mean(ii$mpg, na.rm = TRUE), digits = 2)
jj <- ii
#
# #Median Replacement
ii <- bb
#ii %>% mutate(across(mpg, replace_na, round(median(mpg, na.rm = TRUE), 2)))
ii$mpg[which(is.na(ii$mpg))] <- round(median(ii$mpg, na.rm = TRUE), digits = 2)table(bb$brand, useNA = "always")
##
## Europe Japan US <NA>
## 48 51 163 1
bb %>% group_by(brand) %>% summarise(n())
## # A tibble: 4 x 2
## brand `n()`
## <chr> <int>
## 1 Europe 48
## 2 Japan 51
## 3 US 163
## 4 <NA> 1
#
# #Mode Replacement
#ii$brand[which(is.na(ii$brand))] <- f_getMode(ii$brand)
ii$brand[which(is.na(ii$brand))] <- "US"
#
# #Caution: Do not use max() on "character" for mode replacement
# #It will only look for ASCII value of letters
ii <- c("a", "z", "c", "b", "b", "b", "USA", NA, "a")
max(ii, na.rm = TRUE) #Wrong Value
## [1] "z"
f_getMode(ii)
## [1] "b"# #Convert to Factor before using MICE
bb$brand <- factor(bb$brand)
#
# #mice::md.pattern()
na_bb <- md.pattern(bb, plot = FALSE)
na_bb
## sl.No. cubicinches hp weightlbs time.to.60 year mpg cylinders brand
## 260 1 1 1 1 1 1 1 1 1 0
## 1 1 1 1 1 1 1 1 1 0 1
## 1 1 1 1 1 1 1 1 0 1 1
## 1 1 1 1 1 1 1 0 1 1 1
## 0 0 0 0 0 0 1 1 1 3Figure 8.1 Cars: Inserted Missing Value Pattern by md.pattern()
# #Choose Two Numbers from 1:10, Randomly
sample(1:10, 2)
## [1] 7 2
sample(1:10, 2)
## [1] 7 4
sample(1:10, 2)
## [1] 3 10
# #All above calls to generate Two random numbers produce different outcomes
# #Using set.seed() we can regenerate same random numbers everytime
set.seed(3)
sample(1:10, 2)
## [1] 5 7
sample(1:10, 2)
## [1] 4 8
#
# #If we re-fix the seed, the counter works along same pathway and re-generate numbers
set.seed(3)
sample(1:10, 2)
## [1] 5 7
sample(1:10, 2)
## [1] 4 8# #Convert to Factor before using MICE
bb$brand <- factor(bb$brand)
# #mice() for imputation
# #Including all relevant data i.e. skipping Serial Number only
impute <- mice(bb[, 2:9], m = 2, seed = 3)
##
## iter imp variable
## 1 1 mpg cylinders brand
## 1 2 mpg cylinders brand
## 2 1 mpg cylinders brand
## 2 2 mpg cylinders brand
## 3 1 mpg cylinders brand
## 3 2 mpg cylinders brand
## 4 1 mpg cylinders brand
## 4 2 mpg cylinders brand
## 5 1 mpg cylinders brand
## 5 2 mpg cylinders brand
#
print(impute)
## Class: mids
## Number of multiple imputations: 2
## Imputation methods:
## mpg cylinders cubicinches hp weightlbs time.to.60 year brand
## "pmm" "pmm" "" "" "" "" "" "polyreg"
## PredictorMatrix:
## mpg cylinders cubicinches hp weightlbs time.to.60 year brand
## mpg 0 1 1 1 1 1 1 1
## cylinders 1 0 1 1 1 1 1 1
## cubicinches 1 1 0 1 1 1 1 1
## hp 1 1 1 0 1 1 1 1
## weightlbs 1 1 1 1 0 1 1 1
## time.to.60 1 1 1 1 1 0 1 1
#
# #For each iteration we have a different set of imputed data
# #e.g. for 'mpg' in two sets values are
impute$imp$mpg
## 1 2
## 2 28 14
#
# #NOTE: Original Values that were removed earlier
tibble(mpg = bb_22, brand = bb_39, cylinders = bb_43)
## # A tibble: 1 x 3
## mpg brand cylinders
## <dbl> <chr> <dbl>
## 1 31.9 US 8
#
# #Complete First Set
set1_bb <- complete(impute, 1)
tibble(mpg = set1_bb$mpg[2], brand = set1_bb$brand[3], cylinders = set1_bb$cylinders[4])
## # A tibble: 1 x 3
## mpg brand cylinders
## <dbl> <fct> <dbl>
## 1 28 US 8
#
# #Complete Second Set
set2_bb <- complete(impute, 2)
tibble(mpg = set2_bb$mpg[2], brand = set2_bb$brand[3], cylinders = set2_bb$cylinders[4])
## # A tibble: 1 x 3
## mpg brand cylinders
## <dbl> <fct> <dbl>
## 1 14 US 8set.seed(3)
ii <- mice(bb[, 2:9], m = 3)
##
## iter imp variable
## 1 1 mpg cylinders brand
## 1 2 mpg cylinders brand
## 1 3 mpg cylinders brand
## 2 1 mpg cylinders brand
## 2 2 mpg cylinders brand
## 2 3 mpg cylinders brand
## 3 1 mpg cylinders brand
## 3 2 mpg cylinders brand
## 3 3 mpg cylinders brand
## 4 1 mpg cylinders brand
## 4 2 mpg cylinders brand
## 4 3 mpg cylinders brand
## 5 1 mpg cylinders brand
## 5 2 mpg cylinders brand
## 5 3 mpg cylinders brand
set.seed(3)
jj <- mice(bb[, 2:9], m = 3)
##
## iter imp variable
## 1 1 mpg cylinders brand
## 1 2 mpg cylinders brand
## 1 3 mpg cylinders brand
## 2 1 mpg cylinders brand
## 2 2 mpg cylinders brand
## 2 3 mpg cylinders brand
## 3 1 mpg cylinders brand
## 3 2 mpg cylinders brand
## 3 3 mpg cylinders brand
## 4 1 mpg cylinders brand
## 4 2 mpg cylinders brand
## 4 3 mpg cylinders brand
## 5 1 mpg cylinders brand
## 5 2 mpg cylinders brand
## 5 3 mpg cylinders brand
#
# #identical() is FALSE but all.equal() is TRUE
identical(ii, jj)
## [1] FALSE
all.equal(ii, jj)
## [1] TRUE
#
# #Similarly
ii <- mice(bb[, 2:9], m = 3, seed = 3)
##
## iter imp variable
## 1 1 mpg cylinders brand
## 1 2 mpg cylinders brand
## 1 3 mpg cylinders brand
## 2 1 mpg cylinders brand
## 2 2 mpg cylinders brand
## 2 3 mpg cylinders brand
## 3 1 mpg cylinders brand
## 3 2 mpg cylinders brand
## 3 3 mpg cylinders brand
## 4 1 mpg cylinders brand
## 4 2 mpg cylinders brand
## 4 3 mpg cylinders brand
## 5 1 mpg cylinders brand
## 5 2 mpg cylinders brand
## 5 3 mpg cylinders brand
jj <- mice(bb[, 2:9], m = 3, seed = 3)
##
## iter imp variable
## 1 1 mpg cylinders brand
## 1 2 mpg cylinders brand
## 1 3 mpg cylinders brand
## 2 1 mpg cylinders brand
## 2 2 mpg cylinders brand
## 2 3 mpg cylinders brand
## 3 1 mpg cylinders brand
## 3 2 mpg cylinders brand
## 3 3 mpg cylinders brand
## 4 1 mpg cylinders brand
## 4 2 mpg cylinders brand
## 4 3 mpg cylinders brand
## 5 1 mpg cylinders brand
## 5 2 mpg cylinders brand
## 5 3 mpg cylinders brand
identical(ii, jj)
## [1] FALSE
all.equal(ii, jj)
## [1] TRUE(External) MICE Package Author
# #Using the "character" to generate the Warning
ii <- aa
ii$mpg[2] <- ii$brand[3] <- ii$cylinders[4] <- NA
#
tryCatch(expr = {
jj <- mice(ii[, 2:9], m = 1, seed = 3)
}, warning = function(w) {
print(paste0(w))
})
##
## iter imp variable
## 1 1 mpg cylinders
## 2 1 mpg cylinders
## 3 1 mpg cylinders
## 4 1 mpg cylinders
## 5 1 mpg cylinders
## [1] "simpleWarning: Number of logged events: 1\n"
#
# #Warning message: Number of logged events
# #It can occur because of variety of issues in the data
jj$loggedEvents
## NULLRefer Outliers: C03 and Outliers: B12
Figure 8.2 Cars: Histogram and Density of Weight (lbs)
# Set up the plot area to visualise multiple 3 plots simultaneously
par(mfrow = c(1, 3))
# Create the histogram bars
hist(aa$weightlbs,
breaks = 30,
xlim = c(0, 5000),
col = "blue",
border = "black",
ylim = c(0, 40),
xlab = "Weight",
ylab = "Counts",
main = "Histogram of Car Weights")
# Make a box around # the plot
box(which = "plot",
lty = "solid",
col = "black")# #Histogram
#bb <- na.omit(xxflights$air_time)
hh <- tibble(ee = aa$weightlbs)
ttl_hh <- "Cars: Histogram of Weight"
cap_hh <- "B16P02"
# #Basics
median_hh <- round(median(hh[[1]]), 1)
mean_hh <- round(mean(hh[[1]]), 1)
sd_hh <- round(sd(hh[[1]]), 1)
len_hh <- nrow(hh)
#
B16 <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) +
geom_histogram(bins = 50, alpha = 0.4, fill = '#FDE725FF') +
geom_vline(aes(xintercept = mean_hh), color = '#440154FF') +
geom_text(data = tibble(x = mean_hh, y = -Inf,
label = paste0("Mean= ", mean_hh)),
aes(x = x, y = y, label = label),
color = '#440154FF', hjust = -0.5, vjust = 1.3, angle = 90) +
geom_vline(aes(xintercept = median_hh), color = '#3B528BFF') +
geom_text(data = tibble(x = median_hh, y = -Inf,
label = paste0("Median= ", median_hh)),
aes(x = x, y = y, label = label),
color = '#3B528BFF', hjust = -0.5, vjust = -0.7, angle = 90) +
theme(plot.title.position = "panel") +
labs(x = "x", y = "Frequency",
subtitle = paste0("(N=", len_hh, "; ", "Mean= ", mean_hh,
"; Median= ", median_hh, "; SD= ", sd_hh,
")"),
caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, B16)
rm(B16)# #Density Curve
ttl_hh <- "Cars: Desnsity Plot of Weight"
cap_hh <- "B16P03"
# #Get Quantiles and Ranges of mean +/- sigma
q05_hh <- quantile(hh[[1]], .05)
q95_hh <- quantile(hh[[1]], .95)
density_hh <- density(hh[[1]])
density_hh_tbl <- tibble(x = density_hh$x, y = density_hh$y)
sig3r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + 3 * sd_hh})
sig3l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - 3 * sd_hh})
sig2r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + 2 * sd_hh}, {x < mean_hh + 3 * sd_hh})
sig2l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - 2 * sd_hh}, {x > mean_hh - 3 * sd_hh})
sig1r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + sd_hh}, {x < mean_hh + 2 * sd_hh})
sig1l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - sd_hh}, {x > mean_hh - 2 * sd_hh})
sig0r_hh <- density_hh_tbl %>% filter(x > mean_hh, {x < mean_hh + 1 * sd_hh})
sig0l_hh <- density_hh_tbl %>% filter(x < mean_hh, {x > mean_hh - 1 * sd_hh})
#
# #Change x-Axis Ticks interval
xbreaks_hh <- seq(-3, 3)
xpoints_hh <- mean_hh + xbreaks_hh * sd_hh
# #Arrow
arr_y <- 0.0005 #mean(density_hh_tbl$y) #
arr_lst <- list(list("99.7%", xpoints_hh[1], xpoints_hh[7], arr_y),
list("95.4%", xpoints_hh[2], xpoints_hh[6], arr_y),
list("68.3%", xpoints_hh[3], xpoints_hh[5], arr_y))
arr_hh <- arr_lst[[1]]
#
# # Latex Labels
xlabels_hh <- c(TeX(r'($\,\,\mu - 3 \sigma$)'), TeX(r'($\,\,\mu - 2 \sigma$)'),
TeX(r'($\,\,\mu - 1 \sigma$)'), TeX(r'($\mu$)'), TeX(r'($\,\,\mu + 1 \sigma$)'),
TeX(r'($\,\,\mu + 2 \sigma$)'), TeX(r'($\,\,\mu + 3\sigma$)'))
#
B16 <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) +
geom_density(alpha = 0.2, colour = "#21908CFF") +
geom_area(data = sig3l_hh, aes(x = x, y = y), fill = '#440154FF') +
geom_area(data = sig3r_hh, aes(x = x, y = y), fill = '#440154FF') +
geom_area(data = sig2l_hh, aes(x = x, y = y), fill = '#3B528BFF') +
geom_area(data = sig2r_hh, aes(x = x, y = y), fill = '#3B528BFF') +
geom_area(data = sig1l_hh, aes(x = x, y = y), fill = '#21908CFF') +
geom_area(data = sig1r_hh, aes(x = x, y = y), fill = '#21908CFF') +
geom_area(data = sig0l_hh, aes(x = x, y = y), fill = '#5DC863FF') +
geom_area(data = sig0r_hh, aes(x = x, y = y), fill = '#5DC863FF') +
#scale_y_continuous(limits = c(0, 0.009), breaks = seq(0, 0.009, 0.003)) +
scale_y_continuous(labels = function(n){format(n, scientific = FALSE)}) +
scale_x_continuous(breaks = xpoints_hh, labels = xlabels_hh) +
annotate("segment", x = xpoints_hh[4] - 0.5 * sd_hh, xend = arr_hh[[2]], y = arr_hh[[4]],
yend = arr_hh[[4]], arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
annotate("segment", x = xpoints_hh[4] + 0.5 * sd_hh, xend = arr_hh[[3]], y = arr_hh[[4]],
yend = arr_hh[[4]], arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
annotate(geom = "text", x = xpoints_hh[4], y = arr_hh[[4]], label = arr_hh[[1]]) +
theme(plot.title.position = "panel") +
labs(x = "x", y = "Density",
subtitle = paste0("(N=", nrow(.), "; ", "Mean= ", round(mean(.[[1]]), 1),
"; Median= ", round(median(.[[1]]), 1), "; SD= ", round(sd(.[[1]]), 1),
")"),
caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, B16)
rm(B16)Figure 8.3 Cars: Scatterplot of Weight (x) vs MPG (y) with and without the two outliers
# Create a Scatterplot
plot(aa$weightlbs,
aa$mpg,
xlim = c(0, 5000),
ylim = c(0, 600),
xlab = "Weight",
ylab = "MPG",
main = "Scatterplot of MPG by Weight",
type = "p", #Points
pch = 16,
col = "blue")
#Add open black
# circles
points(aa$weightlbs,
aa$mpg,
type = "p",
col = "black")# #IN: hh$x, hh$y, ttl_hh, cap_hh, x_hh, y_hh
# #Define the formula for Trendline calculation
k_gg_formula <- y ~ x
#
B16 <- hh %>% { ggplot(data = ., aes(x = x, y = y)) +
geom_smooth(method = 'lm', formula = k_gg_formula, se = FALSE) +
stat_poly_eq(aes(label = paste0("atop(", ..eq.label.., ", \n", ..rr.label.., ")")),
formula = k_gg_formula, eq.with.lhs = "italic(hat(y))~`=`~",
eq.x.rhs = "~italic(x)", parse = TRUE) +
geom_vline(aes(xintercept = round(mean(x), 3)), color = '#440154FF', linetype = "dashed") +
geom_hline(aes(yintercept = round(mean(y), 3)), color = '#440154FF', linetype = "dashed") +
geom_text(data = tibble(x = mean(.[["x"]]), y = -Inf,
label = TeX(r'($\bar{x}$)', output = "character")),
aes(x = x, y = y, label = label),
size = 4, color = '#440154FF', hjust = 1.5, vjust = -1, parse = TRUE ) +
geom_text(data = tibble(x = 0, y = mean(.[["y"]]),
label = TeX(r'($\bar{y}$)', output = "character")),
aes(x = x, y = y, label = label),
size = 4, color = '#440154FF', hjust = 1.5, vjust = 1.5, parse = TRUE ) +
geom_point() +
k_gglayer_scatter +
labs(x = x_hh, y = y_hh,
#subtitle = TeX(r"(Trendline Equation, $R^{2}$, $\bar{x}$ and $\bar{y}$)"),
caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, B16)
rm(B16)Figure 8.4 Cars: BoxPlot of MPG (excluding 1 point) vs. Cylinders (4, 6, 8)
boxplot(mpg ~ cyl, data = aa, xlab = "Number of Cylinders",
ylab = "Miles Per Gallon", main = "Mileage Data")# #BoxPlot
hh <- aa %>% select(mpg, cylinders) %>% filter(!cylinders %in% c(3, 5)) %>%
filter(mpg < max(mpg)) %>% mutate(across(cylinders, factor))
#
ttl_hh <- "BoxPlot of MPG (excluding 1 point) vs. Cylinders (4, 6, 8)"
cap_hh <- "B16P06"
x_hh <- "Cylinders"
y_hh <- "MPG"
#
B16 <- hh %>% { ggplot(data = ., mapping = aes(x = cylinders, y = mpg, fill = cylinders)) +
geom_boxplot(outlier.shape = NA) +
geom_point(position = position_jitterdodge(jitter.width = 0.1), size = 1, alpha = 0.7) +
k_gglayer_box +
theme(legend.position = 'none') +
labs(x = x_hh, y = y_hh, caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, B16)
rm(B16)bb <- aa
dim(bb)
## [1] 263 9
#
# #summary() or quantile()
summary(bb$weightlbs)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 192.5 2245.5 2830.0 2992.9 3654.5 4997.0
q_bb <- quantile(bb$weightlbs, probs = c(.25, .75), na.rm = TRUE)
q_bb
## 25% 75%
## 2245.5 3654.5
#
iqr_bb <- IQR(bb$weightlbs)
iqr_bb
## [1] 1409
#
upp_bb <- q_bb[2] + 1.5 * iqr_bb
low_bb <- q_bb[1] - 1.5 * iqr_bb
#
kept_bb <- bb[bb$weightlbs >= low_bb & bb$weightlbs <= upp_bb, ]
if(nrow(bb) == nrow(kept_bb)) {
cat(paste0("No Point was removed because none was outside the range.\n"))
} else cat(paste0("Number of Points removed = ", nrow(bb) - nrow(kept_bb),"\n"))
## No Point was removed because none was outside the range.bb <- aa %>% filter(cylinders == 6)
dim(bb)
## [1] 57 9
#
# #summary() or quantile()
summary(bb$weightlbs)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 192.5 2910.0 3121.0 3120.4 3415.0 3907.0
q_bb <- quantile(bb$weightlbs, probs = c(.25, .75), na.rm = TRUE)
q_bb
## 25% 75%
## 2910 3415
#
iqr_bb <- IQR(bb$weightlbs)
iqr_bb
## [1] 505
#
upp_bb <- q_bb[2] + 1.5 * iqr_bb
low_bb <- q_bb[1] - 1.5 * iqr_bb
#
kept_bb <- bb[bb$weightlbs >= low_bb & bb$weightlbs <= upp_bb, ]
if(nrow(bb) == nrow(kept_bb)) {
cat(paste0("No Point was removed because none was outside the range.\n"))
} else cat(paste0("Number of Points removed = ", nrow(bb) - nrow(kept_bb),"\n"))
## Number of Points removed = 1To be continued …
Please import the "B16-Cars2.csv".
23.20 The z-score, \({z_i}\), can be interpreted as the number of standard deviations \({x_i}\) is from the mean \({\overline{x}}\). It is associated with each \({x_i}\). The z-score is often called the standardized value or standard score.
# #Normalising Weight
bb <- aa %>% select(weightlbs) %>% mutate(z = as.vector(scale(weightlbs)))
str(bb)
## tibble [263 x 2] (S3: tbl_df/tbl/data.frame)
## $ weightlbs: num [1:263] 4209 1925 3449 3761 2051 ...
## $ z : num [1:263] 1.402 -1.231 0.526 0.885 -1.086 ...
#
# #Excluding Outliers
kept_bb <- bb[bb$z >= -3 & bb$z <= 3, ]
str(kept_bb)
## tibble [262 x 2] (S3: tbl_df/tbl/data.frame)
## $ weightlbs: num [1:262] 4209 1925 3449 3761 2051 ...
## $ z : num [1:262] 1.402 -1.231 0.526 0.885 -1.086 ...
#
# #Similarly with mpg
kept_bb <- aa %>% select(mpg) %>% mutate(z = as.vector(scale(mpg))) %>% filter(z >= -3 & z <= 3)
str(kept_bb)
## tibble [262 x 2] (S3: tbl_df/tbl/data.frame)
## $ mpg: num [1:262] 14 31.9 17 15 30.5 23 13 14 25.4 37.7 ...
## $ z : num [1:262] -0.346 0.213 -0.252 -0.314 0.17 ...
summary(kept_bb)
## mpg z
## Min. :10.00 Min. :-0.47040
## 1st Qu.:16.93 1st Qu.:-0.25421
## Median :22.00 Median :-0.09577
## Mean :23.15 Mean :-0.05981
## 3rd Qu.:28.70 3rd Qu.: 0.11340
## Max. :46.60 Max. : 0.67222# #scale(x, center = TRUE, scale = TRUE) output is Nx1 Matrix
bb <- aa %>% select(weightlbs)
ii <- bb %>% mutate(z = as.vector(scale(weightlbs)))
#bb %>% mutate(z = across(weightlbs, scale)) #matrix
#bb %>% mutate(z = across(weightlbs, ~ as.vector(scale(.)))) #tibble
jj <- bb %>% mutate(across(weightlbs, list(z = ~ as.vector(scale(.))), .names = "{.fn}"))
kk <- bb
kk$z <- as.vector(scale(kk$weightlbs))
stopifnot(all(identical(ii, jj), identical(ii, kk)))# #Min-Max Scaling
min_aa <- min(aa$weightlbs)
max_aa <- max(aa$weightlbs)
bb <- aa %>% select(weightlbs) %>% mutate(z = {weightlbs - min_aa}/{max_aa - min_aa})
str(bb)
## tibble [263 x 2] (S3: tbl_df/tbl/data.frame)
## $ weightlbs: num [1:263] 4209 1925 3449 3761 2051 ...
## $ z : num [1:263] 0.836 0.361 0.678 0.743 0.387 ...# #Count Digits in Maximum (NOTE: Take care of NA, 0, [-1, 1] values)
d_bb <- 10^{floor(log10(max(abs(aa$weightlbs)))) + 1}
# #Decimal Scaling
bb <- aa %>% select(weightlbs) %>% mutate(z = weightlbs/d_bb)
str(bb)
## tibble [263 x 2] (S3: tbl_df/tbl/data.frame)
## $ weightlbs: num [1:263] 4209 1925 3449 3761 2051 ...
## $ z : num [1:263] 0.421 0.192 0.345 0.376 0.205 ...Figure 9.1 Cars: Histogram of Weight (Original vs Scaled)
par(mfrow = c(1,2))
# Create two histograms
hist(bb$weightlbs, breaks = 20,
xlim = c(1000, 5000),
main = "Histogram of Weight",
xlab = "Weight",
ylab = "Counts")
box(which = "plot",
lty = "solid",
col = "black")
#
hist(bb$z,
breaks = 20,
xlim = c(-2, 3),
main = "Histogram of Zscore
of Weight",
xlab = "Z-score of Weight",
ylab = "Counts")
box(which = "plot",
lty = "solid",
col = "black")# #Skewness
bb <- aa %>% select(weightlbs) %>% mutate(z = as.vector(scale(weightlbs)))
ii <- bb$weightlbs
#
3 * {mean(ii) - median(ii)} / sd(ii)
## [1] 0.5632797
#
ii <- bb$z
3 * {mean(ii) - median(ii)} / sd(ii)
## [1] 0.5632797bb <- aa %>% select(weightlbs) %>%
mutate(z = as.vector(scale(weightlbs)), Sqrt = sqrt(weightlbs),
Log = log(weightlbs), InvSqr = 1/Sqrt)
#
# #Check Skewness
vapply(bb, function(x) round(3 * {mean(x) - median(x)} / sd(x), 3), numeric(1))
## weightlbs z Sqrt Log InvSqr
## 0.563 0.563 0.339 0.086 0.145Figure 9.2 Cars: Weight Transformed with Original & Scaled
# #Histogram
bb <- aa %>% select(weightlbs) %>%
mutate(z = as.vector(scale(weightlbs)), Sqrt = sqrt(weightlbs),
Log = log(weightlbs), InvSqr = 1/Sqrt) %>%
pivot_longer(everything(), names_to = "Key", values_to = "Values") %>%
mutate(across(Key, factor, levels = c("Sqrt", "Log", "InvSqr", "weightlbs", "z"),
labels = c("Square Root", "Natural Log", "Inverse Square",
"Original Weight", "Scaled Weight")))
#
hh <- bb
mean_hh <- hh %>% group_by(Key) %>% summarize(Mean = mean(Values))
#
ttl_hh <- "Cars: Weight with Transformed values and Mean"
cap_hh <- "B17P03"
#
B17 <- hh %>% { ggplot(data = ., mapping = aes(Values)) +
geom_histogram(bins = 50, alpha = 0.4, fill = '#FDE725FF') +
geom_vline(data = mean_hh, aes(xintercept = Mean), color = '#440154FF') +
geom_text(data = mean_hh, aes(x = Mean, y = -Inf, label = paste0("Mean= ", f_pNum(Mean))),
color = '#440154FF', hjust = -0.5, vjust = 1.3, angle = 90) +
facet_wrap(~Key, scales = 'free_x') +
theme(plot.title.position = "panel") +
labs(x = "x", y = "Frequency", caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, B17)
rm(B17)f_pNum <- function(x, digits = 2L) {
# #Print Numbers
# #round(), rounds to a number of decimal places
# #signif() rounds to a specific number of significant places
# #if(){} else if(){} else{} is NOT vectorised
#ifelse(abs(x) < 0.0000001, 0*sign(x), ifelse(abs(x) > 1, round(x, digits), signif(x, digits + 1L)))
ifelse(abs(x) < 0.0000001, 0*sign(x), floor(x) + signif(x %% 1, digits))
}# #Histogram with Normal Distribution Overlay
par(mfrow=c(1,1))
hist(bb$InvSqr,
breaks = 30,
xlim=c(0.0125, 0.0275),
col = "lightblue",
prob = TRUE,
border = "black",
xlab="Inverse Square Root of Weight",
ylab = "Counts",
main = "Histogram of Inverse Square Root of Weight")
box(which = "plot",
lty = "solid",
col="black")
# #Overlay Normal density
lines(density(bb$InvSqr), col="red")Q–Q (quantile-quantile) plot is a probability plot for comparing two probability distributions by plotting their quantiles against each other. A point \((x, y)\) on the plot corresponds to one of the quantiles of the second distribution (y-coordinate) plotted against the same quantile of the first distribution (x-coordinate). If the two distributions being compared are similar, the points in the Q–Q plot will approximately lie on the line \(y = x\).
Figure 9.3 Cars: QQ Plots of Transformed Weight
Figure 9.4 Cars: QQ Plots of Transformed Weight
# #QQ Plot
bb <- aa %>% select(weightlbs) %>%
filter(weightlbs > min(weightlbs)) %>%
mutate(z = as.vector(scale(weightlbs)), Sqrt = sqrt(weightlbs),
Log = log(weightlbs), InvSqr = 1/Sqrt) %>%
pivot_longer(everything(), names_to = "Key", values_to = "Values") %>%
mutate(across(Key, factor, levels = c("Sqrt", "Log", "InvSqr", "weightlbs", "z"),
labels = c("Square Root", "Natural Log", "Inverse Square",
"Original Weight", "Scaled Weight")))
#
hh <- bb
#hh %>% group_by(Key) %>% summarize(Max = max(Values), Min = min(Values))
max_hh <- min_hh <- hh %>% group_by(Key) %>% summarise(Values = 0)
#
# #Modify Number of Y-Axis Major Gridlines for Horizontal Comparison
max_hh$Values <- c(100, 8.55, 0.0300, 5000, 2.35) #c(72, 8.55, 0.0255, 5000, 2.35)
min_hh$Values <- c(20, 7.35, 0.0135, 1500, -1.65) #c(40, 7.35, 0.0135, 1500, -1.65)
#
ttl_hh <- "QQ Plots of Transformed Weight"
sub_hh <- "Excluded 1 Outlier and Modified Y-axis for alignment"
cap_hh <- "B17P04"
#
B17 <- hh %>% { ggplot(., aes(sample = Values)) +
stat_qq() +
stat_qq_line() +
geom_blank(data=max_hh, aes(y = Values)) +
geom_blank(data=min_hh, aes(y = Values)) +
facet_wrap(~Key, scales = 'free') +
scale_x_continuous(limits = c(-3, 3)) +
#coord_flip() +
labs(caption = cap_hh, subtitle = sub_hh, title = ttl_hh)
}
assign(cap_hh, B17)
rm(B17)# Normal Q-Q Plot
qqnorm(bb$InvSqr,
datax = TRUE,
col = "red",
ylim = c(0.01, 0.03),
main = "Normal
Q-Q Plot of Inverse Square Root of Weight")
qqline(bb$InvSqr,
col = "blue",
datax = TRUE)30.5 The Shapiro-Wilk test is a test of normality. It tests the null hypothesis that a sample came from a normally distributed population. \(P_{\text{shapiro}} > ({\alpha} = 0.05) \to \text{Data is Normal}\). Avoid using sample with more than 5000 observations.
set.seed(3)
ii <- rnorm(n = 100, mean = 50, sd = 5.99)
#
# #Check Normality of randomly generated Normal dataset
shapiro.test(ii)
##
## Shapiro-Wilk normality test
##
## data: ii
## W = 0.97928, p-value = 0.1167
#
# #Check Normality of Weight
ii <- aa %>% select(weightlbs) %>%
#filter(weightlbs > min(weightlbs)) %>%
mutate(z = as.vector(scale(weightlbs)), Sqrt = sqrt(weightlbs),
Log = log(weightlbs), InvSqr = 1/Sqrt) %>%
pivot_longer(everything(), names_to = "Key", values_to = "Values") %>%
mutate(across(Key, factor, levels = c("Sqrt", "Log", "InvSqr", "weightlbs", "z"),
labels = c("Square Root", "Natural Log", "Inverse Square",
"Original Weight", "Scaled Weight")))
#
# #No Transformation was able to convert the data to Normality
# #Even after excluding 1 outlier (Not shown here)
ii %>% group_by(Key) %>%
summarise(p_Shapiro = shapiro.test(Values)$p.value,
isNormal = ifelse(p_Shapiro > 0.05, TRUE, FALSE))
## # A tibble: 5 x 3
## Key p_Shapiro isNormal
## <fct> <dbl> <lgl>
## 1 Square Root 2.14e- 7 FALSE
## 2 Natural Log 1.45e-14 FALSE
## 3 Inverse Square 4.03e-25 FALSE
## 4 Original Weight 6.81e- 7 FALSE
## 5 Scaled Weight 6.81e- 7 FALSE# #Continuous to Categorical (Bins)
cut_ii <- cut(aa$weightlbs, breaks = 3, dig.lab = 4, include.lowest = TRUE, ordered_result = TRUE)
levels(cut_ii)
## [1] "[187.7,1794]" "(1794,3396]" "(3396,5002]"
#
# #ggplot2::cut_interval()
cut_jj <- cut_interval(aa$weightlbs, n = 3, dig.lab = 4, ordered_result = TRUE)
levels(cut_jj)
## [1] "[192.5,1794]" "(1794,3396]" "(3396,4997]"
#
# #With Labels: NOTE default ordering is ascending
levels(cut(aa$weightlbs, breaks = 3, dig.lab = 4, include.lowest = TRUE, ordered_result = TRUE,
labels = c("low", "medium", "high")))
## [1] "low" "medium" "high"
levels(cut_interval(aa$weightlbs, n = 3, dig.lab = 4, ordered_result = TRUE,
labels = c("low", "medium", "high")))
## [1] "low" "medium" "high"bb <- aa %>% select(weightlbs) %>% rename(Weight = 1)
#
# #Subsetting
# #Create Column explicitly to prevent Warning message: Unknown or uninitialised column: `ii`.
bb$ii <- NA
bb$ii[bb$Weight >= 3000] <- 1
bb$ii[bb$Weight < 3000] <- 2
#
# #Using ifelse() or case_when()
bb <- bb %>% mutate(jj = ifelse(Weight >= 3000, 1, 2),
kk = case_when(Weight >= 3000 ~ 1, Weight < 3000 ~ 2))
stopifnot(all(identical(bb$ii, bb$jj), identical(bb$ii, bb$kk)))# #Create Data
set.seed(3)
bb <- tibble(x = rnorm(n = 10, mean = 5, sd = 0.55),
y = rnorm(n = 10, mean = 4.5, sd = 0.66))
#
# #Basic Indexing
bb$i <- 1:nrow(bb)
# #Indexcan be started from anywhere. However it is not recommended.
bb$j <- 5:{nrow(bb) + 5L - 1L}
#
# #Other Methods
bb$k <- seq_along(bb[[1]])
bb$l <- seq_len(nrow(bb))
bb$m <- seq.int(nrow(bb))
# #Note the placement of column at the beginning i.e. column index modified
bb <- cbind(n = 1:nrow(bb), bb)
bb <- rowid_to_column(bb, "o")
#
bb <- bb %>% mutate(p = row_number())
#
# #Excluding 'j' all other columns are equal. However, 'n' & 'o' modify column index
stopifnot(all(identical(bb$i, bb$k), identical(bb$i, bb$k), identical(bb$i, bb$l),
identical(bb$i, bb$m), identical(bb$i, bb$n), identical(bb$i, bb$o), identical(bb$i, bb$p)))if(FALSE){# #WARNING: Installation may take some time.
install.packages("esquisse", dependencies = TRUE)
}Please import the "B18-Churn.xlsx".
| Col_Row | Row_1 | Row_2 | Row_3 | Row_4 | Row_5 | Row_6 |
|---|---|---|---|---|---|---|
| State | KS | OH | NJ | OH | OK | AL |
| Account Length | 128 | 107 | 137 | 84 | 75 | 118 |
| Area Code | 415 | 415 | 415 | 408 | 415 | 510 |
| Phone | 382-4657 | 371-7191 | 358-1921 | 375-9999 | 330-6626 | 391-8027 |
| Int’l Plan | no | no | no | yes | yes | yes |
| VMail Plan | yes | yes | no | no | no | no |
| VMail Message | 25 | 26 | 0 | 0 | 0 | 0 |
| Day Mins | 265.1 | 161.6 | 243.4 | 299.4 | 166.7 | 223.4 |
| Day Calls | 110 | 123 | 114 | 71 | 113 | 98 |
| Day Charge | 45.07 | 27.47 | 41.38 | 50.9 | 28.34 | 37.98 |
| Eve Mins | 197.4 | 195.5 | 121.2 | 61.9 | 148.3 | 220.6 |
| Eve Calls | 99 | 103 | 110 | 88 | 122 | 101 |
| Eve Charge | 16.78 | 16.62 | 10.3 | 5.26 | 12.61 | 18.75 |
| Night Mins | 244.7 | 254.4 | 162.6 | 196.9 | 186.9 | 203.9 |
| Night Calls | 91 | 103 | 104 | 89 | 121 | 118 |
| Night Charge | 11.01 | 11.45 | 7.32 | 8.86 | 8.41 | 9.18 |
| Intl Mins | 10 | 13.7 | 12.2 | 6.6 | 10.1 | 6.3 |
| Intl Calls | 3 | 3 | 5 | 7 | 3 | 6 |
| Intl Charge | 2.7 | 3.7 | 3.29 | 1.78 | 2.73 | 1.7 |
| CustServ Calls | 1 | 1 | 0 | 2 | 3 | 0 |
| Churn | False. | False. | False. | False. | False. | False. |
| State | Account Length | Area Code | Phone | Int’l Plan | VMail Plan | VMail Message | Day Mins | Day Calls | Day Charge | Eve Mins | Eve Calls | Eve Charge | Night Mins | Night Calls | Night Charge | Intl Mins | Intl Calls | Intl Charge | CustServ Calls | Churn |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| KS | 128 | 415 | 382-4657 | no | yes | 25 | 265.1 | 110 | 45.07 | 197.4 | 99 | 16.78 | 244.7 | 91 | 11.01 | 10.0 | 3 | 2.70 | 1 | False. |
| OH | 107 | 415 | 371-7191 | no | yes | 26 | 161.6 | 123 | 27.47 | 195.5 | 103 | 16.62 | 254.4 | 103 | 11.45 | 13.7 | 3 | 3.70 | 1 | False. |
| NJ | 137 | 415 | 358-1921 | no | no | 0 | 243.4 | 114 | 41.38 | 121.2 | 110 | 10.30 | 162.6 | 104 | 7.32 | 12.2 | 5 | 3.29 | 0 | False. |
| OH | 84 | 408 | 375-9999 | yes | no | 0 | 299.4 | 71 | 50.90 | 61.9 | 88 | 5.26 | 196.9 | 89 | 8.86 | 6.6 | 7 | 1.78 | 2 | False. |
| OK | 75 | 415 | 330-6626 | yes | no | 0 | 166.7 | 113 | 28.34 | 148.3 | 122 | 12.61 | 186.9 | 121 | 8.41 | 10.1 | 3 | 2.73 | 3 | False. |
| AL | 118 | 510 | 391-8027 | yes | no | 0 | 223.4 | 98 | 37.98 | 220.6 | 101 | 18.75 | 203.9 | 118 | 9.18 | 6.3 | 6 | 1.70 | 0 | False. |
anyNA(bb)
## [1] FALSEFigure 10.1 Churn: States Frequency
# #Select | Rename
bb <- aa %>% select(`Area Code`, State) %>% rename(Area = "Area Code")
# #Select | Group | Frequency | Descending
ii <- bb %>% select(State) %>% group_by(State) %>% summarise(CNT = n()) %>% arrange(desc(CNT)) %>%
mutate(across(State, factor, levels = rev(unique(State)), ordered = TRUE))ii <- bb %>% mutate(across(everything(), factor))
#
# #Unique Values
ii %>% summarise(across(everything(), ~ length(unique(.))))
## # A tibble: 1 x 2
## Area State
## <int> <int>
## 1 3 51
#
summary(ii)
## Area State
## 408: 838 WV : 106
## 415:1655 MN : 84
## 510: 840 NY : 83
## AL : 80
## OH : 78
## OR : 78
## (Other):2824
#
str(levels(ii$Area))
## chr [1:3] "408" "415" "510"
str(levels(ii$State))
## chr [1:51] "AK" "AL" "AR" "AZ" "CA" "CO" "CT" "DC" "DE" "FL" "GA" "HI" "IA" "ID" "IL" "IN" ...# #Proper Sorting of Factors for Flipped Axes
hh <- ii %>% mutate(nState = as.integer(State))
# #Because the CNT have duplicated values ggplot would add them up if used on x-axis
anyDuplicated(ii$CNT)
# #So, place it on Y-axis and then flip the axis
#
# #Set Alternate Labels as blanks on both Primary and Secondary x-axis
#x_sec <- x_prim <- as.character(hh$State)
#x_prim[1:nrow(hh) %%2 != 1] <- ""
#x_sec[1:nrow(hh) %%2 == 1] <- ""
#
# #Get Median Location
#hh %>% filter(CNT == median(CNT)) %>% mutate(as.integer(State))
median_loc_hh <- ceiling(nrow(hh)/2)
#
cap_hh <- "B18P01"
ttl_hh <- "Churn: Frequency of States"
sub_hh <- paste0(nrow(hh), " States with Median = ", median(hh$CNT)) #NULL
#
B18 <- hh %>% { ggplot(data = ., aes(x = nState, y = CNT)) +
geom_bar(stat = 'identity', aes(fill = (nState %% 2 == 0))) +
geom_vline(aes(xintercept = median_loc_hh), color = '#440154FF') +
scale_x_continuous( #sec.axis = sec_axis(~., breaks = 1:nrow(.), labels = rev(.$State)),
breaks = 1:nrow(.), guide = guide_axis(n.dodge = 2), labels = rev(.$State)) +
k_gglayer_bar +
coord_flip() +
labs(x = "State", y = "Frequency", subtitle = sub_hh,
caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, B18)
rm(B18)Figure 10.2 Churn: All Histograms
ii <- bb %>%
select(where(is.numeric)) %>%
select(!area_code) %>%
relocate(ends_with("_mins")) %>%
relocate(ends_with("_calls")) %>%
relocate(vmail_message, .after = last_col()) %>%
pivot_longer(everything(), names_to = "Key", values_to = "Values") %>%
mutate(across(Key, ~ factor(., levels = unique(Key))))
#
str(ii)
# #Histogram
hh <- ii
ttl_hh <- "Churn: Histograms"
cap_hh <- "B18P03"
#
B18 <- hh %>% { ggplot(data = ., mapping = aes(x = Values)) +
geom_histogram(bins = ifelse(length(unique(.[[1]])) > 50, 50, length(unique(.[[1]]))),
alpha = 0.4, fill = '#FDE725FF') +
theme(plot.title.position = "panel",
strip.text.x = element_text(size = 10, colour = "white")) +
facet_wrap(~Key, nrow = 3, scales = 'free') +
labs(x = "x", y = NULL, caption = cap_hh, subtitle = NULL, title = ttl_hh)
}
assign(cap_hh, B18)
rm(B18)Figure 10.3 Churn: All QQ Plots
# #QQ Plots
hh <- ii
ttl_hh <- "Churn: QQ Plots"
cap_hh <- "B18P04"
#
B18 <- hh %>% { ggplot(., aes(sample = Values)) +
stat_qq() +
stat_qq_line() +
facet_wrap(~Key, nrow = 3, scales = 'free_y') +
#scale_x_continuous(limits = c(-3, 3)) +
#coord_flip() +
theme(plot.title.position = "panel",
strip.text.x = element_text(size = 10, colour = "white")) +
labs(x = "x", y = NULL, caption = cap_hh, subtitle = NULL, title = ttl_hh)
}
assign(cap_hh, B18)
rm(B18)Figure 10.4 Churn: BoxPlots of Calls, Minutes, & Charges
# #BoxPlot
B18 <- hh %>% { ggplot(data = ., mapping = aes(x = Key, y = Values, fill = Key)) +
geom_boxplot() +
k_gglayer_box +
theme(legend.position = 'none') +
labs(x = NULL, y = NULL, caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, B18)
rm(B18)Figure 10.5 Churn: BoxPlots of International Calls, Minutes, & Charges
# #BoxPlot
B18 <- hh %>% { ggplot(data = ., mapping = aes(y = Values)) +
geom_boxplot() +
k_gglayer_box +
theme(legend.position = 'none') +
labs(x = NULL, y = NULL, caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, B18)
rm(B18)Figure 10.6 Churn: BoxPlots of Reamining Three
Figure 10.7 Churn: International Calls
# #Rename to Proper Names | To Lower, Replace by Underscore | Coercion
bb <- aa %>% rename_with(make.names) %>%
rename_with(~ tolower(gsub(".", "_", .x, fixed = TRUE))) %>%
mutate(across(c(int_l_plan, vmail_plan), ~case_when(. == "yes" ~ TRUE, . == "no" ~ FALSE))) %>%
mutate(across(churn, ~case_when(. == "True." ~ TRUE, . == "False." ~ FALSE))) %>%
mutate(across(ends_with("_calls"), as.integer))
#t(bb %>% summarise(across(everything(), ~length(unique(.)))))
#str(bb)
#summary(bb)# #Histogram
hh <- tibble(ee = bb$intl_calls)
ttl_hh <- "Churn: Histogram of International Calls"
cap_hh <- "B18P02"
# #Bins
summary(hh[[1]])
bins_hh <- ifelse(length(unique(hh[[1]])) > 50, 50, length(unique(hh[[1]])))
# #Basics
median_hh <- round(median(hh[[1]]), 1)
mean_hh <- round(mean(hh[[1]]), 1)
sd_hh <- round(sd(hh[[1]]), 1)
len_hh <- nrow(hh)
#
B18 <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) +
geom_histogram(bins = bins_hh, alpha = 0.4, fill = '#FDE725FF') +
geom_vline(aes(xintercept = mean_hh), color = '#440154FF') +
geom_text(data = tibble(x = mean_hh, y = -Inf,
label = paste0("Mean= ", mean_hh)),
aes(x = x, y = y, label = label),
color = '#440154FF', hjust = -0.5, vjust = 1.3, angle = 90) +
geom_vline(aes(xintercept = median_hh), color = '#3B528BFF') +
geom_text(data = tibble(x = median_hh, y = -Inf,
label = paste0("Median= ", median_hh)),
aes(x = x, y = y, label = label),
color = '#3B528BFF', hjust = -0.5, vjust = -0.7, angle = 90) +
theme(plot.title.position = "panel") +
labs(x = "x", y = "Frequency",
subtitle = paste0("(N=", len_hh, "; ", "Mean= ", mean_hh,
"; Median= ", median_hh, "; SD= ", sd_hh,
")"),
caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, B18)
rm(B18)Figure 10.8 Churn: All Histograms (Scaled)
Figure 10.9 Churn: All QQ Plots (Scaled)
Figure 10.10 Churn: All Box Plots (Scaled)
Figure 10.11 Churn: All Histograms
43.1 In unsupervised methods, no target variable is identified as such. Instead, the data mining algorithm searches for patterns and structures among all the variables. The most common unsupervised data mining method is clustering. Ex: Voter Profile.
43.2 Supervised methods are those in which there is a particular prespecified target variable and the algorithm is given many examples where the value of the target variable is provided. This allows the algorithm to learn which values of the target variable are associated with which values of the predictor variables.
if(FALSE){# #WARNING: Installation may take some time.
install.packages("rfm", dependencies = TRUE)
install.packages("lubridate", dependencies = TRUE)
}Please import the "B19-Transaction.csv".
# #character to date using dmy() #wwww
bb <- aa
str(bb)
## spec_tbl_df [4,906 x 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ customer_id: chr [1:4906] "Mr. Brion Stark Sr." "Ethyl Botsford" "Hosteen Jacobi" "Mr. Edw Frami" ...
## $ order_date : chr [1:4906] "20-12-2004" "02-05-2005" "06-03-2004" "15-03-2006" ...
## $ revenue : num [1:4906] 32 36 116 99 76 56 108 183 30 13 ...
bb$order_date <- dmy(bb$order_date)
#
str(bb)
## spec_tbl_df [4,906 x 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ customer_id: chr [1:4906] "Mr. Brion Stark Sr." "Ethyl Botsford" "Hosteen Jacobi" "Mr. Edw Frami" ...
## $ order_date : Date[1:4906], format: "2004-12-20" "2005-05-02" "2004-03-06" ...
## $ revenue : num [1:4906] 32 36 116 99 76 56 108 183 30 13 ...
anyNA(bb)
## [1] FALSE
summary(bb)
## customer_id order_date revenue
## Length:4906 Min. :2001-10-29 Min. : 10.00
## Class :character 1st Qu.:2004-10-07 1st Qu.: 45.00
## Mode :character Median :2005-06-08 Median : 81.00
## Mean :2005-05-29 Mean : 94.61
## 3rd Qu.:2005-11-08 3rd Qu.:137.00
## Max. :2006-12-30 Max. :219.00
#
# #Get Analysis Date as the Next Date after the Max Date in the Data
analysis_date <- max(bb$order_date) + 1 #as_date("2006-12-31")
#
# #RFM analysis by rfm_table_order()
rfm_result <- rfm_table_order(bb, customer_id = customer_id, order_date = order_date,
revenue = revenue, analysis_date = analysis_date)
# #Output is a Tibble with some other attributes
loc_src <- paste0(.z$XL, "B19-Transaction-RFM.csv")
# #Save the Result in a CSV
if(FALSE) write_csv(rfm_result$rfm, file = loc_src)# #Bins of RFM
str(rfm_result$rfm)
## tibble [995 x 9] (S3: tbl_df/tbl/data.frame)
## $ customer_id : chr [1:995] "Abbey O'Reilly DVM" "Add Senger" "Aden Lesch Sr." "Admiral Senger" ...
## $ date_most_recent : Date[1:995], format: "2006-06-09" "2006-08-13" "2006-06-20" ...
## $ recency_days : num [1:995] 205 140 194 132 90 84 281 246 349 619 ...
## $ transaction_count: num [1:995] 6 3 4 5 9 9 8 4 3 4 ...
## $ amount : num [1:995] 472 340 405 448 843 763 699 157 363 196 ...
## $ recency_score : int [1:995] 3 4 3 4 5 5 3 3 2 1 ...
## $ frequency_score : int [1:995] 4 1 2 3 5 5 5 2 1 2 ...
## $ monetary_score : int [1:995] 3 2 3 3 5 5 5 1 2 1 ...
## $ rfm_score : num [1:995] 343 412 323 433 555 555 355 321 212 121 ...
# #Recency: Unlike the other Two its Ranking feels reversed i.e. 5 is assigned to lowest value
# #However 5 is assigned to 'Most Recent'
rfm_result$rfm %>%
group_by(recency_score) %>%
summarise(MIN = min(recency_days), MAX = max(recency_days), N = n())
## # A tibble: 5 x 4
## recency_score MIN MAX N
## <int> <dbl> <dbl> <int>
## 1 1 482 976 197
## 2 2 298 481 200
## 3 3 181 297 199
## 4 4 116 180 199
## 5 5 1 114 200
# #Frequency
rfm_result$rfm %>%
group_by(frequency_score) %>%
summarise(MIN = min(transaction_count), MAX = max(transaction_count), N = n())
## # A tibble: 5 x 4
## frequency_score MIN MAX N
## <int> <dbl> <dbl> <int>
## 1 1 1 3 268
## 2 2 4 4 187
## 3 3 5 5 176
## 4 4 6 7 244
## 5 5 8 14 120
# #Monetrary
rfm_result$rfm %>%
group_by(monetary_score) %>%
summarise(MIN = min(amount), MAX = max(amount), N = n())
## # A tibble: 5 x 4
## monetary_score MIN MAX N
## <int> <dbl> <dbl> <int>
## 1 1 12 255 200
## 2 2 258 381 200
## 3 3 382 506 198
## 4 4 507 665 202
## 5 5 668 1488 195# #Read CSV
jj <- read_csv(loc_src, show_col_types = FALSE) %>%
mutate(across(c(recency_score, frequency_score, monetary_score), as.integer))
ii <- rfm_result$rfm
#
attr(jj, "spec") <- NULL
attr(jj, "problems") <- NULL
# #Verification
all_equal(ii, jj) #TRUE
## [1] TRUE
#
attributes(ii)$class
## [1] "tbl_df" "tbl" "data.frame"
attributes(jj)$class
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
#
# #Modify Class Attribute i.e. Remove 1st "spec_tbl_df"
attr(jj, "class") <- attr(jj, "class")[-1]
#
all.equal(ii, jj) #TRUE
## [1] TRUE
identical(ii, jj) #TRUE
## [1] TRUE
#
# #NOTE Position of Attributes does not matter
names(attributes(ii))
## [1] "names" "row.names" "class"
names(attributes(jj))
## [1] "row.names" "names" "class"# #character to date using dmy()
bb <- aa
str(bb)
## spec_tbl_df [4,906 x 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ customer_id: chr [1:4906] "Mr. Brion Stark Sr." "Ethyl Botsford" "Hosteen Jacobi" "Mr. Edw Frami" ...
## $ order_date : chr [1:4906] "20-12-2004" "02-05-2005" "06-03-2004" "15-03-2006" ...
## $ revenue : num [1:4906] 32 36 116 99 76 56 108 183 30 13 ...
#
ii <- bb
ii$order_date <- dmy(ii$order_date)
#
# #Equivalent
jj <- bb %>% mutate(order_date = dmy(order_date))
stopifnot(identical(ii, jj))
#
str(jj)
## spec_tbl_df [4,906 x 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ customer_id: chr [1:4906] "Mr. Brion Stark Sr." "Ethyl Botsford" "Hosteen Jacobi" "Mr. Edw Frami" ...
## $ order_date : Date[1:4906], format: "2004-12-20" "2005-05-02" "2004-03-06" ...
## $ revenue : num [1:4906] 32 36 116 99 76 56 108 183 30 13 ...
anyNA(jj)
## [1] FALSE
summary(jj)
## customer_id order_date revenue
## Length:4906 Min. :2001-10-29 Min. : 10.00
## Class :character 1st Qu.:2004-10-07 1st Qu.: 45.00
## Mode :character Median :2005-06-08 Median : 81.00
## Mean :2005-05-29 Mean : 94.61
## 3rd Qu.:2005-11-08 3rd Qu.:137.00
## Max. :2006-12-30 Max. :219.00Segment rules might look like arbitory however lots of thought goes into this. This is a tedious task.
# #Developing segments
segment_titles <- c("First Grade", "Loyal", "Likely to be Loyal", "New Ones",
"Could be Promising", "Require Assistance", "Getting Less Frequent",
"Almost Out", "Can not Lose Them", "Do not Show Up at All")
# #Define Rules of Minimum and Maximum RFM for each group
r_low <- c(4, 2, 3, 4, 3, 2, 2, 1, 1, 1)
r_high <- c(5, 5, 5, 5, 4, 3, 3, 2, 1, 2)
f_low <- c(4, 3, 1, 1, 1, 2, 1, 2, 4, 1)
f_high <- c(5, 5, 3, 1, 1, 3, 2, 5, 5, 2)
m_low <- c(4, 3, 1, 1, 1, 2, 1, 2, 4, 1)
m_high <- c(5, 5, 3, 1, 1, 3, 2, 5, 5, 2)
#
stopifnot(all(vapply(list(r_low, r_high, f_low, f_high, m_low, m_high),
FUN = function(x) identical(length(x), length(segment_titles)), logical(1))))divisions <- rfm_segment(rfm_result, segment_names = segment_titles,
recency_lower = r_low, recency_upper = r_high,
frequency_lower = f_low, frequency_upper = f_high,
monetary_lower = m_low, monetary_upper = m_high)
# #Output is a Tibble
# #Save the Result in a CSV
loc_src <- paste0(.z$XL, "B19-Transaction-Divisions.csv")
if(FALSE) write_csv(divisions, file = loc_src)
#
# #We defined 10 segments, However only 7 (+1) of them are represented in the data
# #and 48 customers were not captured by our classifications. These were assigned to 'Others'
divisions %>%
count(segment) %>%
mutate(PCT = round(100 * n / sum(n), 1)) %>%
rename(SEGMENT = segment, FREQ = n) %>%
arrange(desc(FREQ))
## # A tibble: 8 x 3
## SEGMENT FREQ PCT
## <chr> <int> <dbl>
## 1 Loyal 278 27.9
## 2 Likely to be Loyal 229 23
## 3 First Grade 158 15.9
## 4 Do not Show Up at All 111 11.2
## 5 Almost Out 86 8.6
## 6 Getting Less Frequent 50 5
## 7 Others 48 4.8
## 8 Require Assistance 35 3.5
## #Histogram of Median RFM can be plotted.
# #These are ggplot graphs so can be improved later by manually plotting
if(FALSE) {#Histograms of Median RFM for each Segment
hh <- divisions
rfm_plot_median_recency(hh)
rfm_plot_median_frequency(hh)
rfm_plot_median_monetary(hh)
}
if(FALSE) {
hh <- rfm_result
rfm_histograms(hh) #Histograms of RFM
rfm_order_dist(hh) #Histograms of Customer Orders i.e. Frequency
rfm_heatmap(hh) #Heatmap of Monetary on Axes of Recency and Frequency. Slighly Useful
rfm_bar_chart(hh) #Bar Charts with Facetting of RFM
# #Scatter Plots among Recency, Monetary, Frequency
rfm_rm_plot(hh)
rfm_fm_plot(hh)
rfm_rf_plot(hh)
}Please import the "B19-Customer.csv".
# #character to date using dmy() #wwww
bb <- aa <- xxB19Customer
str(bb)
## spec_tbl_df [39,999 x 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ customer_id : num [1:39999] 22086 2290 26377 24650 12883 ...
## $ revenue : num [1:39999] 777 1555 336 1189 1229 ...
## $ most_recent_visit: chr [1:39999] "14-05-2006" "08-09-2006" "19-11-2006" "29-10-2006" ...
## $ number_of_orders : num [1:39999] 9 16 5 12 12 11 17 11 9 10 ...
## $ recency_days : num [1:39999] 232 115 43 64 23 72 112 142 43 131 ...
bb$most_recent_visit <- dmy(bb$most_recent_visit)
#
# #Get Analysis Date as the Next Date after the Max Date in the Data
analysis_date <- max(bb$most_recent_visit) + 1 #as_date("2006-12-31")
#
# #RFM analysis by rfm_table_customer()
rfm_customer <- rfm_table_customer(bb, customer_id = customer_id, n_transactions = number_of_orders,
recency_days = recency_days, total_revenue = revenue, analysis_date = analysis_date)
# #Output is a Tibble with some other attributes
# #Save the Result in a CSV
loc_src <- paste0(.z$XL, "B19-Customer-RFM.csv")
if(FALSE) write_csv(rfm_customer$rfm, file = loc_src)Please import the "B19-OnlineRetail.csv".
bb <- aa <- xxB19Retail
#
# #NOTE dates are in mmddyyyy format
bb$InvoiceDate[5000:5010]
## [1] "12-02-2010" "12-02-2010" "12-02-2010" "12-02-2010" "12-02-2010" "12-02-2010" "12-02-2010"
## [8] "12-02-2010" "12-02-2010" "12-02-2010" "12-02-2010"
bb$InvoiceDate <- mdy(bb$InvoiceDate)
#
# #Which Columns have NA
which(vapply(bb, anyNA, logical(1)))
## Description CustomerID
## 3 7
#
# #Remove NA | Remove Unit Price with 0 | Quantity 0 or Negative i.e. Returns | Dropped Columns |
# #Calculate Revenue
ii <- bb %>%
drop_na(CustomerID) %>%
filter(UnitPrice > 0 & Quantity > 0) %>%
select(-c(1:3, 8)) %>%
mutate(Revenue = UnitPrice * Quantity)
#
summary(ii)
## Quantity InvoiceDate UnitPrice CustomerID Revenue
## Min. : 1.00 Min. :2010-12-01 Min. : 0.001 Min. :12346 Min. : 0.00
## 1st Qu.: 2.00 1st Qu.:2011-04-07 1st Qu.: 1.250 1st Qu.:13969 1st Qu.: 4.68
## Median : 6.00 Median :2011-07-31 Median : 1.950 Median :15159 Median : 11.80
## Mean : 12.99 Mean :2011-07-10 Mean : 3.116 Mean :15294 Mean : 22.40
## 3rd Qu.: 12.00 3rd Qu.:2011-10-20 3rd Qu.: 3.750 3rd Qu.:16795 3rd Qu.: 19.80
## Max. :80995.00 Max. :2011-12-09 Max. :8142.750 Max. :18287 Max. :168469.60# #Developing segments
segment_titles <- c("First Grade", "Loyal", "Likely to be Loyal", "New Ones",
"Could be Promising", "Require Assistance", "Getting Less Frequent",
"Almost Out", "Can not Lose Them", "Do not Show Up at All")
# #Define Rules of Minimum and Maximum RFM for each group
r_low <- c(4, 2, 3, 4, 3, 2, 2, 1, 1, 1)
r_high <- c(5, 5, 5, 5, 4, 3, 3, 2, 1, 2)
f_low <- c(4, 3, 1, 1, 1, 2, 1, 2, 4, 1)
f_high <- c(5, 5, 3, 1, 1, 3, 2, 5, 5, 2)
m_low <- c(4, 3, 1, 1, 1, 2, 1, 2, 4, 1)
m_high <- c(5, 5, 3, 1, 1, 3, 2, 5, 5, 2)
#
stopifnot(all(vapply(list(r_low, r_high, f_low, f_high, m_low, m_high),
FUN = function(x) identical(length(x), length(segment_titles)), logical(1))))# #Get Analysis Date as the Next Date after the Max Date in the Data
analysis_date <- max(ii$InvoiceDate) + 1 #as_date("2011-12-10")
rfm_ii <- rfm_table_order(ii, customer_id = CustomerID, order_date = InvoiceDate,
revenue = Revenue, analysis_date = analysis_date)
div_ii <- rfm_segment(rfm_ii, segment_names = segment_titles,
recency_lower = r_low, recency_upper = r_high,
frequency_lower = f_low, frequency_upper = f_high,
monetary_lower = m_low, monetary_upper = m_high)
# #Sorted Count of Segments
div_ii %>%
count(segment) %>%
mutate(PCT = round(100 *n / sum(n), 1)) %>%
rename(SEGMENT = segment, FREQ = n) %>%
arrange(desc(FREQ))
## # A tibble: 8 x 3
## SEGMENT FREQ PCT
## <chr> <int> <dbl>
## 1 Loyal 1163 26.8
## 2 First Grade 920 21.2
## 3 Likely to be Loyal 741 17.1
## 4 Almost Out 439 10.1
## 5 Do not Show Up at All 404 9.3
## 6 Others 287 6.6
## 7 Getting Less Frequent 214 4.9
## 8 Require Assistance 170 3.9if(FALSE){# #WARNING: Installation may take some time.
install.packages("factoextra", dependencies = TRUE)
}44.1 Clustering refers to the grouping of records, observations, or cases into classes of similar objects.
44.2 A cluster is a collection of records that are similar to one another and dissimilar to records in other clusters.
44.3 Euclidean distance between records is given by equation, \(d_{\text{Euclidean}}(x,y) = \sqrt{\sum_i{\left(x_i - y_i\right)^2}}\), where \(x = \{x_1, x_2, \ldots, x_m\}\) and \(y = \{y_1, y_2, \ldots, y_m\}\) represent the \({m}\) attribute values of two records.
Please import the "B20-movie.csv".
bb <- aa <- xxB20Movies
# #Drop ID | Scale |
xw <- aa %>% select(-1)
zw <- xw %>% mutate(across(everything(), ~ as.vector(scale(.))))
#
summary(xw)
## Horror Romcom Action Comedy Fantasy
## Min. : 0.00 Min. : 0.00 Min. : 24.60 Min. : 0.00 Min. : 0.00
## 1st Qu.: 40.00 1st Qu.:19.90 1st Qu.: 58.75 1st Qu.: 38.50 1st Qu.: 28.95
## Median : 62.80 Median :29.70 Median : 70.50 Median : 60.00 Median : 41.20
## Mean : 58.57 Mean :31.25 Mean : 68.84 Mean : 56.52 Mean : 45.61
## 3rd Qu.: 78.25 3rd Qu.:41.65 3rd Qu.: 80.55 3rd Qu.: 73.45 3rd Qu.: 59.85
## Max. :100.00 Max. :81.30 Max. :100.00 Max. :100.00 Max. :100.00# #Fix Seed
set.seed(3)
# #Cluster analysis with different k = {2, 3, 4}
k2_zw <- kmeans(zw, centers = 2)
k3_zw <- kmeans(zw, centers = 3)
k4_zw <- kmeans(zw, centers = 4)
#
names(k2_zw)
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss"
## [7] "size" "iter" "ifault"
#
# #Two Clusters
ii <- k2_zw
# #within-cluster sum of squares (Preferred lower value i.e. Homogeneity within cluster)
ii$withinss
## [1] 159.6976 605.7610
identical(ii$tot.withinss, sum(ii$withinss))
## [1] TRUE
# #between-cluster sum of squares
ii$betweenss
## [1] 684.5413
# #The total sum of squares
ii$totss
## [1] 1450
paste0("Between /Total = ", round(100 * ii$betweenss / ii$totss, 2), "%")
## [1] "Between /Total = 47.21%"
#
# #Members within Clusters
ii$size
## [1] 73 218
#
# #Matrix of cluster centres
round(ii$centers, 3)
## Horror Romcom Action Comedy Fantasy
## 1 -1.383 1.200 -1.170 0.835 1.267
## 2 0.463 -0.402 0.392 -0.279 -0.424
#
# #Cluster Membership of each point
str(ii$cluster)
## int [1:291] 2 2 2 2 2 2 2 2 2 2 ...
#
# #Save cluster membership of each point back into the dataset
res_movies <- cbind(xw,
list(k2 = k2_zw$cluster, k3 = k3_zw$cluster, k4 = k4_zw$cluster)) %>% as_tibble()# #Three Clusters
ii <- k3_zw
ii$size
## [1] 72 105 114
paste0("Between /Total = ", round(100 * ii$betweenss / ii$totss, 2), "%")
## [1] "Between /Total = 61.81%"
round(ii$centers, 3)
## Horror Romcom Action Comedy Fantasy
## 1 -1.392 1.210 -1.177 0.816 1.296
## 2 0.906 -0.368 0.432 -1.108 -0.036
## 3 0.044 -0.425 0.346 0.505 -0.785# #Four Clusters
ii <- k4_zw
ii$size
## [1] 73 51 69 98
paste0("Between /Total = ", round(100 * ii$betweenss / ii$totss, 2), "%")
## [1] "Between /Total = 64.73%"
round(ii$centers, 3)
## Horror Romcom Action Comedy Fantasy
## 1 0.068 -0.899 0.362 0.530 -0.817
## 2 0.065 0.355 0.364 0.327 -0.572
## 3 -1.434 1.232 -1.214 0.848 1.325
## 4 0.925 -0.383 0.396 -1.162 -0.027hh <- zw
cap_hh <- "B20P01"
ttl_hh <- "Movie: Elbow Curve (WSS)"
loc_png <- paste0(.z$PX, "B20P01", "-Movie-Elbow-Wss", ".png")
#
# #factoextra::fviz_nbclust() generates ggplot
# #method = "wss" (for total within sum of square)
B20P01 <- fviz_nbclust(hh, FUNcluster = kmeans, method = "wss") +
labs(caption = cap_hh, title = ttl_hh)Figure 12.1 Movie: Elbow Curve (WSS) in FactoExtra and Base R
Figure 12.2 Movie: Genres with k=3
xxB20Movies <- f_getRDS(xxB20Movies)
bb <- aa <- xxB20Movies
# #Drop ID | Scale |
xw <- aa %>% select(-1)
zw <- xw %>% mutate(across(everything(), ~ as.vector(scale(.))))
#
summary(xw)
## Horror Romcom Action Comedy Fantasy
## Min. : 0.00 Min. : 0.00 Min. : 24.60 Min. : 0.00 Min. : 0.00
## 1st Qu.: 40.00 1st Qu.:19.90 1st Qu.: 58.75 1st Qu.: 38.50 1st Qu.: 28.95
## Median : 62.80 Median :29.70 Median : 70.50 Median : 60.00 Median : 41.20
## Mean : 58.57 Mean :31.25 Mean : 68.84 Mean : 56.52 Mean : 45.61
## 3rd Qu.: 78.25 3rd Qu.:41.65 3rd Qu.: 80.55 3rd Qu.: 73.45 3rd Qu.: 59.85
## Max. :100.00 Max. :81.30 Max. :100.00 Max. :100.00 Max. :100.0045.3 The silhouette is a characteristic of each data value. For each data value i,
45.3 \(\text{Silhouette}_i = s_i = \frac{b_i - a_i}{\text{max}(b_i, a_i)} \to s_i \in [-1, 1]\), where \(a_i\) is the distance between the data value (Cohesion) and its cluster center, and \(b_i\) is the distance between the data value and the next closest cluster center (Separation).
hh <- zw
cap_hh <- "B21P01"
ttl_hh <- "Movie: Elbow Curve (Silhouette)"
#
# #method = "silhouette" (for average silhouette width)
B21P01 <- fviz_nbclust(hh, FUNcluster = kmeans, method = "silhouette") +
labs(caption = cap_hh, title = ttl_hh)Figure 13.1 Movie: Elbow Curve of k (Silhouette)
Please import the "B21-state-crime.csv".
aa <- xxB21Crime
# #Only Year 2019 | Exculte USA Total | Only Rates Variables NOT Total | Scale |
xw <- aa %>%
filter(Year == "2019", State != "United States") %>%
select(Data.Population, starts_with("Data.Rates") & !ends_with("All"))
#
# #Rename Columns for ease of use
ii <- names(xw)
ii <- str_replace(ii, pattern = paste0(c("Data.Rates.", "Data."), collapse = "|"), "")
ii <- str_replace_all(ii, c("Violent." = "v_", "Property." = "p_"))
names(xw) <- ii
#
zw <- xw %>% mutate(across(everything(), ~ as.vector(scale(.))))
#
dim(xw)
## [1] 51 8
summary(xw)
## Population p_Burglary p_Larceny p_Motor v_Assault
## Min. : 578759 Min. :126.3 Min. : 911.8 Min. : 47.8 Min. : 61.3
## 1st Qu.: 1789606 1st Qu.:243.2 1st Qu.:1193.5 1st Qu.:141.6 1st Qu.:165.9
## Median : 4467673 Median :328.7 Median :1555.7 Median :203.8 Median :246.3
## Mean : 6436069 Mean :345.6 Mean :1580.4 Mean :215.2 Mean :257.2
## 3rd Qu.: 7446805 3rd Qu.:412.4 3rd Qu.:1846.9 3rd Qu.:274.1 3rd Qu.:309.9
## Max. :39512223 Max. :696.8 Max. :3775.4 Max. :427.2 Max. :650.5
## v_Murder v_Rape v_Robbery
## Min. : 1.500 Min. : 17.20 Min. : 8.70
## 1st Qu.: 2.550 1st Qu.: 36.85 1st Qu.: 41.10
## Median : 4.600 Median : 44.60 Median : 63.60
## Mean : 5.145 Mean : 47.66 Mean : 68.89
## 3rd Qu.: 6.400 3rd Qu.: 55.00 3rd Qu.: 80.95
## Max. :23.500 Max. :148.70 Max. :384.40Figure 13.2 Crime: Elbow Curve of k Silhouette and WSS
# #Cluster analysis with different k = {3, 4}
set.seed(3)
k3_zw <- kmeans(zw, centers = 3)
k4_zw <- kmeans(zw, centers = 4)
# #Save cluster membership of each point back into the dataset
res_crime <- cbind(xw, list(k3 = k3_zw$cluster, k4 = k4_zw$cluster)) %>% as_tibble()
#
# #Three Clusters
ii <- k3_zw
ii$size
## [1] 1 21 29
paste0("Between /Total = ", round(100 * ii$betweenss / ii$totss, 2), "%")
## [1] "Between /Total = 48.54%"
round(ii$centers, 3)
## Population p_Burglary p_Larceny p_Motor v_Assault v_Murder v_Rape v_Robbery
## 1 -0.779 -0.597 4.574 1.185 2.612 4.963 0.063 5.708
## 2 0.176 0.934 0.599 0.864 0.556 0.356 0.291 0.200
## 3 -0.100 -0.656 -0.592 -0.666 -0.493 -0.429 -0.213 -0.342
#
# #Four Clusters
ii <- k4_zw
ii$size
## [1] 15 13 11 12
paste0("Between /Total = ", round(100 * ii$betweenss / ii$totss, 2), "%")
## [1] "Between /Total = 48.14%"
round(ii$centers, 3)
## Population p_Burglary p_Larceny p_Motor v_Assault v_Murder v_Rape v_Robbery
## 1 0.733 0.274 0.255 0.377 -0.203 0.073 -0.273 0.287
## 2 -0.484 -0.303 -0.301 -0.184 -0.237 -0.526 0.222 -0.573
## 3 -0.365 1.240 1.118 1.026 1.464 1.200 0.773 0.775
## 4 -0.058 -1.151 -1.017 -1.213 -0.831 -0.621 -0.609 -0.44844.4 In hierarchical clustering, a treelike cluster structure (dendrogram) is created through recursive partitioning (divisive methods) or combining (agglomerative) of existing clusters.
44.5 Agglomerative clustering methods initialize each observation to be a tiny cluster of its own. Then, in succeeding steps, the two closest clusters are aggregated into a new combined cluster. In this way, the number of clusters in the data set is reduced by one at each step. Eventually, all records are combined into a single huge cluster. mMost computer programs that apply hierarchical clustering use agglomerative methods.
44.6 Divisive clustering methods begin with all the records in one big cluster, with the most dissimilar records being split off recursively, into a separate cluster, until each record represents its own cluster.
44.7 Single linkage, the nearest-neighbor approach, is based on the minimum distance between any record in cluster A and any record in cluster B. Cluster similarity is based on the similarity of the most similar members from each cluster. It tends to form long, slender clusters, which may sometimes lead to heterogeneous records being clustered together.
44.8 Complete linkage, the farthest-neighbor approach, is based on the maximum distance between any record in cluster A and any record in cluster B. Cluster similarity is based on the similarity of the most dissimilar members from each cluster. It tends to form more compact, spherelike clusters.
44.9 Average linkage is designed to reduce the dependence of the cluster-linkage criterion on extreme values, such as the most similar or dissimilar records. The criterion is the average distance of all the records in cluster A from all the records in cluster B. The resulting clusters tend to have approximately equal within-cluster variability. In general, average linkage leads to clusters more similar in shape to complete linkage than does single linkage.
xxB20Movies <- f_getRDS(xxB20Movies)
bb <- aa <- xxB20Movies
# #Drop ID | Scale |
xw <- aa %>% select(-1)
zw <- xw %>% mutate(across(everything(), ~ as.vector(scale(.))))
#
summary(xw)
## Horror Romcom Action Comedy Fantasy
## Min. : 0.00 Min. : 0.00 Min. : 24.60 Min. : 0.00 Min. : 0.00
## 1st Qu.: 40.00 1st Qu.:19.90 1st Qu.: 58.75 1st Qu.: 38.50 1st Qu.: 28.95
## Median : 62.80 Median :29.70 Median : 70.50 Median : 60.00 Median : 41.20
## Mean : 58.57 Mean :31.25 Mean : 68.84 Mean : 56.52 Mean : 45.61
## 3rd Qu.: 78.25 3rd Qu.:41.65 3rd Qu.: 80.55 3rd Qu.: 73.45 3rd Qu.: 59.85
## Max. :100.00 Max. :81.30 Max. :100.00 Max. :100.00 Max. :100.00if(FALSE){# #WARNING: Installation may take some time.
install.packages("cluster", dependencies = TRUE)
install.packages("arules", dependencies = TRUE)
install.packages("arulesViz", dependencies = TRUE)
}Agglomerative Clustering is also known as ‘Bottom-up’ and Devisive Clustering is also known as ‘Top-down’
Upto 16:18 Mathematical Formulation of Linkages which are NOT included here.
Question: Nothing much can be inferred from Cluster 2, even though it has high number of data points (136). Does it call for further split
Question: When k=2 the average silhouette value improved, so should we accept this as optimum number of clusters
Question: Is it ok to mix and match k-means and hierarchical clustering
str(zw)
## tibble [291 x 5] (S3: tbl_df/tbl/data.frame)
## $ Horror : num [1:291] 0.572 0.97 0.469 1.663 1.043 ...
## $ Romcom : num [1:291] -0.0792 0.8276 0.751 -0.6032 -1.8397 ...
## $ Action : num [1:291] -0.0152 0.4812 -0.2351 0.5692 -0.0466 ...
## $ Comedy : num [1:291] -0.699 -1.728 -0.125 -1.374 -0.297 ...
## $ Fantasy: num [1:291] 0.559 1.004 -0.355 -0.241 -0.255 ...
#
# #Create distance matrix
dist_zw <- dist(zw)
#
hclust_com_zw <- hclust(dist_zw, method = "complete")
hclust_avg_zw <- hclust(dist_zw, method = "average")
hclust_sng_zw <- hclust(dist_zw, method = "single")
#
# #Cut Tree by Cluster membership
k2_com_zw <- cutree(hclust_com_zw, 2)
k3_com_zw <- cutree(hclust_com_zw, 3)
k4_com_zw <- cutree(hclust_com_zw, 4)
#
table(k3_com_zw)
## k3_com_zw
## 1 2 3
## 97 136 58
str(k3_com_zw)
## int [1:291] 1 1 2 1 1 2 1 1 1 1 ...
# #Save cluster membership of each point back into the dataset
res_movies <- cbind(xw, list(k3 = k3_com_zw, k4 = k4_com_zw)) %>% as_tibble()
#
# #Cluster Mean
if(FALSE) aggregate(zw, by = list(k3_com_zw), FUN = function(x) round(mean(x), 3))
# #Equivalent
res_movies %>% select(-k4) %>% group_by(k3) %>% summarise(N = n(), across(everything(), mean))
## # A tibble: 3 x 7
## k3 N Horror Romcom Action Comedy Fantasy
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 97 80.2 23.8 77.0 31.2 45.7
## 2 2 136 58.0 26.7 72.4 66.1 32.4
## 3 3 58 23.7 54.5 47.0 76.5 76.5Figure 14.1 Movie: Dendrogram (Complete Linkage) with k =3 G, 4 B, 6 R
Figure 14.2 Movie: Silhouette with Distance for k={3, 4}
Figure 14.3 Movie: Silhouette with Distance for k={3, 2}
46.1 Affinity analysis is the study of attributes or characteristics that “go together.” It seeks to uncover rules for quantifying the relationship between two or more attributes. Association rules take the form "If antecedent, then consequent", along with a measure of the support and confidence associated with the rule.
It is unsupervised learning
Ex: People who purchased Milk, also purchased Bread.
Question: Is it similar to Conjoint Analysis
Problem: Dimensionality
Refer Association Rules
46.2 The support (s) for a particular association rule \(A \Rightarrow B\) is the proportion of transactions in the set of transactions D that contain both antecedent A and consequent B. Support is Symmetric. \(\text{Support} = P(A \cap B) = \frac{\text{Number of transactions containing both A and B}}{\text{Total Number of Transactions}}\)
46.3 The confidence (c) of the association rule \(A \Rightarrow B\) is a measure of the accuracy of the rule, as determined by the percentage of transactions in the set of transactions D containing antecedent A that also contain consequent B. Confidence is Asymmetric \(\text{Confidence} = P(B|A) = \frac{P(A \cap B)}{P(A)} = \frac{\text{Number of transactions containing both A and B}}{\text{Total Number of Transactions containing A}}\)
46.8 Lift is a measure that can quantify the usefulness of an association rule. Lift is Symmetric. \(\text{Lift} = \frac{\text{Rule Confidence}}{\text{Prior proportion of Consequent}}\)
Please import the "B22-Makeup.csv".
bb <- aa <- xxB22Makeup
#
xw <- aa %>% mutate(across(everything(), factor, levels = c("No", "Yes")))
#str(xw)
dim(xw)
## [1] 1000 14
summary(xw)
## Bag Blush Nail Polish Brushes Concealer Eyebrow Pencils Bronzer Lip liner Mascara
## No :946 No :637 No :720 No :851 No :558 No :958 No :721 No :766 No :643
## Yes: 54 Yes:363 Yes:280 Yes:149 Yes:442 Yes: 42 Yes:279 Yes:234 Yes:357
## Eye shadow Foundation Lip Gloss Lipstick Eyeliner
## No :619 No :464 No :510 No :678 No :543
## Yes:381 Yes:536 Yes:490 Yes:322 Yes:457`ASparameter-class`)attributes(summary(rules))$length# #Caution is advised on running inspect() without prior subsetting /filtering the rules
# #Find association rules
#rules <- apriori(xw, maxlen = ncol(xw))
rules <- apriori(xw, parameter = list(maxlen = ncol(xw)))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen maxlen target ext
## 0.8 0.1 1 none FALSE TRUE 5 0.1 1 14 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 100
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[28 item(s), 1000 transaction(s)] done [0.00s].
## sorting and recoding items ... [26 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10 11 done [0.04s].
## writing ... [68960 rule(s)] done [0.01s].
## creating S4 object ... done [0.03s].
#
# #More Information
names(attributes(rules))
## [1] "quality" "info" "lhs" "rhs" "class"
#
str(attributes(rules)$quality)
## 'data.frame': 68960 obs. of 5 variables:
## $ support : num 0.851 0.946 0.958 0.149 0.129 0.138 0.214 0.224 0.257 0.265 ...
## $ confidence: num 0.851 0.946 0.958 1 0.866 ...
## $ coverage : num 1 1 1 0.149 0.149 0.149 0.234 0.234 0.279 0.279 ...
## $ lift : num 1 1 1 3.571 0.915 ...
## $ count : int 851 946 958 149 129 138 214 224 257 265 ...
str(attributes(rules)$info)
## List of 5
## $ data : symbol xw
## $ ntransactions: int 1000
## $ support : num 0.1
## $ confidence : num 0.8
## $ call : chr "apriori(data = xw, parameter = list(maxlen = ncol(xw)))"
attributes(rules)$lhs
## itemMatrix in sparse format with
## 68960 rows (elements/transactions) and
## 28 columns (items)
attributes(rules)$rhs
## itemMatrix in sparse format with
## 68960 rows (elements/transactions) and
## 28 columns (items)
#
summary(rules)
## set of 68960 rules
##
## rule length distribution (lhs + rhs):sizes
## 1 2 3 4 5 6 7 8 9 10 11
## 3 85 942 4350 10739 17062 18066 11996 4665 972 80
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 6.000 7.000 6.547 8.000 11.000
##
## summary of quality measures:
## support confidence coverage lift count
## Min. :0.1000 Min. :0.8000 Min. :0.1000 Min. :0.8781 Min. :100.0
## 1st Qu.:0.1150 1st Qu.:0.8669 1st Qu.:0.1250 1st Qu.:1.0389 1st Qu.:115.0
## Median :0.1370 Median :0.9453 Median :0.1490 Median :1.1568 Median :137.0
## Mean :0.1582 Mean :0.9260 Mean :0.1717 Mean :1.2021 Mean :158.2
## 3rd Qu.:0.1770 3rd Qu.:0.9821 3rd Qu.:0.1930 3rd Qu.:1.2440 3rd Qu.:177.0
## Max. :0.9580 Max. :1.0000 Max. :1.0000 Max. :3.5714 Max. :958.0
##
## mining info:
## data ntransactions support confidence call
## xw 1000 0.1 0.8 apriori(data = xw, parameter = list(maxlen = ncol(xw)))
#
names(attributes(summary(rules)))
## [1] "lengths" "lengthSummary" "length" "quality" "info" "class"
attributes(summary(rules))$length #Check Number of Rules Here.
## [1] 68960
attributes(summary(rules))$lengths
## sizes
## 1 2 3 4 5 6 7 8 9 10 11
## 3 85 942 4350 10739 17062 18066 11996 4665 972 80
#
# #inspect() Do not execute without knowing how many rows will be printed.
#inspect(rules[1:6])
#inspect(head(rules, 6))
inspect(head(rules, min(5, attributes(summary(rules))$length)))
## lhs rhs support confidence coverage lift count
## [1] {} => {Brushes=No} 0.851 0.8510000 1.000 1.0000000 851
## [2] {} => {Bag=No} 0.946 0.9460000 1.000 1.0000000 946
## [3] {} => {Eyebrow Pencils=No} 0.958 0.9580000 1.000 1.0000000 958
## [4] {Brushes=Yes} => {Nail Polish=Yes} 0.149 1.0000000 0.149 3.5714286 149
## [5] {Brushes=Yes} => {Bag=No} 0.129 0.8657718 0.149 0.9151922 129# #Rules with more control and oversight.
rr_sup <- 0.7
rr_conf <- 0.8
rules <- apriori(xw, parameter = list(
minlen = 2, maxlen = ncol(xw), support = rr_sup, confidence = rr_conf))| SN | LHS_Antecedent | x | RHS_Consequent | Support | Confidence | Coverage | Lift | Count |
|---|---|---|---|---|---|---|---|---|
| 1 | {Nail Polish=No} | => | {Brushes=No} | 0.72 | 1 | 0.72 | 1.18 | 720 |
| 2 | {Brushes=No} | => | {Nail Polish=No} | 0.72 | 0.846 | 0.851 | 1.18 | 720 |
| 3 | {Lip liner=No} | => | {Bag=No} | 0.732 | 0.956 | 0.766 | 1.01 | 732 |
| 4 | {Lip liner=No} | => | {Eyebrow Pencils=No} | 0.734 | 0.958 | 0.766 | 1 | 734 |
| 5 | {Brushes=No} | => | {Bag=No} | 0.817 | 0.96 | 0.851 | 1.01 | 817 |
| 6 | {Bag=No} | => | {Brushes=No} | 0.817 | 0.864 | 0.946 | 1.01 | 817 |
| 7 | {Brushes=No} | => | {Eyebrow Pencils=No} | 0.82 | 0.964 | 0.851 | 1.01 | 820 |
| 8 | {Eyebrow Pencils=No} | => | {Brushes=No} | 0.82 | 0.856 | 0.958 | 1.01 | 820 |
| 9 | {Bag=No} | => | {Eyebrow Pencils=No} | 0.909 | 0.961 | 0.946 | 1 | 909 |
| 10 | {Eyebrow Pencils=No} | => | {Bag=No} | 0.909 | 0.949 | 0.958 | 1 | 909 |
| 11 | {Bag=No, Lip liner=No} | => | {Eyebrow Pencils=No} | 0.703 | 0.96 | 0.732 | 1 | 703 |
| 12 | {Eyebrow Pencils=No, Lip liner=No} | => | {Bag=No} | 0.703 | 0.958 | 0.734 | 1.01 | 703 |
| 13 | {Bag=No, Brushes=No} | => | {Eyebrow Pencils=No} | 0.789 | 0.966 | 0.817 | 1.01 | 789 |
| 14 | {Brushes=No, Eyebrow Pencils=No} | => | {Bag=No} | 0.789 | 0.962 | 0.82 | 1.02 | 789 |
| 15 | {Bag=No, Eyebrow Pencils=No} | => | {Brushes=No} | 0.789 | 0.868 | 0.909 | 1.02 | 789 |
# #Do not print more than 50 Rules at at time.
inspect(head(rules, min(50, attributes(summary(rules))$length)))
## lhs rhs support confidence coverage
## [1] {Nail Polish=No} => {Brushes=No} 0.720 1.0000000 0.720
## [2] {Brushes=No} => {Nail Polish=No} 0.720 0.8460635 0.851
## [3] {Lip liner=No} => {Bag=No} 0.732 0.9556136 0.766
## [4] {Lip liner=No} => {Eyebrow Pencils=No} 0.734 0.9582245 0.766
## [5] {Brushes=No} => {Bag=No} 0.817 0.9600470 0.851
## [6] {Bag=No} => {Brushes=No} 0.817 0.8636364 0.946
## [7] {Brushes=No} => {Eyebrow Pencils=No} 0.820 0.9635723 0.851
## [8] {Eyebrow Pencils=No} => {Brushes=No} 0.820 0.8559499 0.958
## [9] {Bag=No} => {Eyebrow Pencils=No} 0.909 0.9608879 0.946
## [10] {Eyebrow Pencils=No} => {Bag=No} 0.909 0.9488518 0.958
## [11] {Bag=No, Lip liner=No} => {Eyebrow Pencils=No} 0.703 0.9603825 0.732
## [12] {Eyebrow Pencils=No, Lip liner=No} => {Bag=No} 0.703 0.9577657 0.734
## [13] {Bag=No, Brushes=No} => {Eyebrow Pencils=No} 0.789 0.9657283 0.817
## [14] {Brushes=No, Eyebrow Pencils=No} => {Bag=No} 0.789 0.9621951 0.820
## [15] {Bag=No, Eyebrow Pencils=No} => {Brushes=No} 0.789 0.8679868 0.909
## lift count
## [1] 1.175088 720
## [2] 1.175088 720
## [3] 1.010162 732
## [4] 1.000234 734
## [5] 1.014849 817
## [6] 1.014849 817
## [7] 1.005817 820
## [8] 1.005817 820
## [9] 1.003015 909
## [10] 1.003015 909
## [11] 1.002487 703
## [12] 1.012437 703
## [13] 1.008067 789
## [14] 1.017120 789
## [15] 1.019961 789# # Limit Max Rows | To Tibble | Rename | Add Row Numbers | Relocate | Format decimals |
inspect(head(rules, min(50, attributes(summary(rules))$length))) %>%
as_tibble(.name_repair = 'unique') %>%
rename(x = '...2', LHS_Antecedent = lhs, RHS_Consequent = rhs) %>%
rename_with(str_to_title, .cols = where(is.numeric)) %>%
mutate(SN = row_number()) %>% relocate(SN) %>%
mutate(across(where(is.numeric), format, digits = 3, drop0trailing = TRUE, scientific = FALSE))
## lhs rhs support confidence coverage
## [1] {Nail Polish=No} => {Brushes=No} 0.720 1.0000000 0.720
## [2] {Brushes=No} => {Nail Polish=No} 0.720 0.8460635 0.851
## [3] {Lip liner=No} => {Bag=No} 0.732 0.9556136 0.766
## [4] {Lip liner=No} => {Eyebrow Pencils=No} 0.734 0.9582245 0.766
## [5] {Brushes=No} => {Bag=No} 0.817 0.9600470 0.851
## [6] {Bag=No} => {Brushes=No} 0.817 0.8636364 0.946
## [7] {Brushes=No} => {Eyebrow Pencils=No} 0.820 0.9635723 0.851
## [8] {Eyebrow Pencils=No} => {Brushes=No} 0.820 0.8559499 0.958
## [9] {Bag=No} => {Eyebrow Pencils=No} 0.909 0.9608879 0.946
## [10] {Eyebrow Pencils=No} => {Bag=No} 0.909 0.9488518 0.958
## [11] {Bag=No, Lip liner=No} => {Eyebrow Pencils=No} 0.703 0.9603825 0.732
## [12] {Eyebrow Pencils=No, Lip liner=No} => {Bag=No} 0.703 0.9577657 0.734
## [13] {Bag=No, Brushes=No} => {Eyebrow Pencils=No} 0.789 0.9657283 0.817
## [14] {Brushes=No, Eyebrow Pencils=No} => {Bag=No} 0.789 0.9621951 0.820
## [15] {Bag=No, Eyebrow Pencils=No} => {Brushes=No} 0.789 0.8679868 0.909
## lift count
## [1] 1.175088 720
## [2] 1.175088 720
## [3] 1.010162 732
## [4] 1.000234 734
## [5] 1.014849 817
## [6] 1.014849 817
## [7] 1.005817 820
## [8] 1.005817 820
## [9] 1.003015 909
## [10] 1.003015 909
## [11] 1.002487 703
## [12] 1.012437 703
## [13] 1.008067 789
## [14] 1.017120 789
## [15] 1.019961 789
## # A tibble: 15 x 9
## SN LHS_Antecedent x RHS_Consequent Support Confidence Coverage Lift Count
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 " 1" {Nail Polish=No} => {Brushes=No} 0.72 1 0.72 1.18 720
## 2 " 2" {Brushes=No} => {Nail Polish=No} 0.72 0.846 0.851 1.18 720
## 3 " 3" {Lip liner=No} => {Bag=No} 0.732 0.956 0.766 1.01 732
## 4 " 4" {Lip liner=No} => {Eyebrow Pencils~ 0.734 0.958 0.766 1 734
## 5 " 5" {Brushes=No} => {Bag=No} 0.817 0.96 0.851 1.01 817
## 6 " 6" {Bag=No} => {Brushes=No} 0.817 0.864 0.946 1.01 817
## 7 " 7" {Brushes=No} => {Eyebrow Pencils~ 0.82 0.964 0.851 1.01 820
## 8 " 8" {Eyebrow Pencils=No} => {Brushes=No} 0.82 0.856 0.958 1.01 820
## 9 " 9" {Bag=No} => {Eyebrow Pencils~ 0.909 0.961 0.946 1 909
## 10 "10" {Eyebrow Pencils=No} => {Bag=No} 0.909 0.949 0.958 1 909
## 11 "11" {Bag=No, Lip liner=No} => {Eyebrow Pencils~ 0.703 0.96 0.732 1 703
## 12 "12" {Eyebrow Pencils=No, Lip l~ => {Bag=No} 0.703 0.958 0.734 1.01 703
## 13 "13" {Bag=No, Brushes=No} => {Eyebrow Pencils~ 0.789 0.966 0.817 1.01 789
## 14 "14" {Brushes=No, Eyebrow Penci~ => {Bag=No} 0.789 0.962 0.82 1.02 789
## 15 "15" {Bag=No, Eyebrow Pencils=N~ => {Brushes=No} 0.789 0.868 0.909 1.02 789# #If Data has TRUE /FALSE in place of Yes /No, Then more options are available: sum() which()
# #Last Option 'mm' does not use table() and remain as tibble() so fewer steps are required
summary(xw)
## Bag Blush Nail Polish Brushes Concealer Eyebrow Pencils Bronzer Lip liner Mascara
## No :946 No :637 No :720 No :851 No :558 No :958 No :721 No :766 No :643
## Yes: 54 Yes:363 Yes:280 Yes:149 Yes:442 Yes: 42 Yes:279 Yes:234 Yes:357
## Eye shadow Foundation Lip Gloss Lipstick Eyeliner
## No :619 No :464 No :510 No :678 No :543
## Yes:381 Yes:536 Yes:490 Yes:322 Yes:457
# #Count Binary Columns | Transpose | Tibble | Integer | Sort |
ii <- t(vapply(xw, table, numeric(2))) %>%
as_tibble(rownames = 'Items') %>%
mutate(across(where(is.numeric), as.integer)) %>%
arrange(desc(Yes))
# #Match One of the Values | Transpose | Tibble | Rename | Wide | Rename | Sort |
jj <- t(table(xw == 'Yes', names(xw)[col(xw)])) %>%
as_tibble(.name_repair = 'unique') %>%
rename(Items = 1, Key = 2) %>%
pivot_wider(names_from = Key, values_from = n) %>%
rename(No = 2, Yes = 3) %>%
arrange(desc(Yes))
# #Unlist | Remove Appended Numbers | Count | Transpose | Tibble | Rename | Wide | Rename | Sort |
kk <- t(table(unlist(xw), sub('\\d+', '', names(unlist(xw))))) %>%
as_tibble(.name_repair = 'unique') %>%
rename(Items = 1, Key = 2) %>%
pivot_wider(names_from = Key, values_from = n) %>%
rename(No = 2, Yes = 3) %>%
arrange(desc(Yes))
# #Long | Table | Tibble | Wide | Rename | Sort |
ll <- xw %>%
pivot_longer(cols = everything(), names_to = 'Items', values_to = 'Key') %>%
table() %>%
as_tibble() %>%
pivot_wider(names_from = Key, values_from = n) %>%
rename(No = 2, Yes = 3) %>%
arrange(desc(Yes))
# #Long | Count | Wide | Sort |
mm <- xw %>%
pivot_longer(cols = everything(), names_to = 'Items', values_to = 'Key') %>%
count(Items, Key) %>%
pivot_wider(names_from = Key, values_from = n) %>%
arrange(desc(Yes))
stopifnot(all(vapply(list(jj, kk, ll, mm), FUN = function(x) identical(x, ii), logical(1))))
#
# #Option 'mm' is preferable
xw %>%
pivot_longer(cols = everything()) %>%
count(name, value) %>%
pivot_wider(names_from = value, values_from = n) %>%
arrange(desc(Yes))
## # A tibble: 14 x 3
## name No Yes
## <chr> <int> <int>
## 1 Foundation 464 536
## 2 Lip Gloss 510 490
## 3 Eyeliner 543 457
## 4 Concealer 558 442
## 5 Eye shadow 619 381
## 6 Blush 637 363
## 7 Mascara 643 357
## 8 Lipstick 678 322
## 9 Nail Polish 720 280
## 10 Bronzer 721 279
## 11 Lip liner 766 234
## 12 Brushes 851 149
## 13 Bag 946 54
## 14 Eyebrow Pencils 958 42mm <- ll <- kk <- jj <- ii <- 1:5
# #Pairwise Identical Check
all(identical(ii, jj), identical(ii, kk), identical(ii, ll), identical(ii, mm))
## [1] TRUE
#
# #Pairwise Identical Check using vapply()
# #It can provide info on which pair does not match OR can be passed to all()
vapply(list(jj, kk, ll, mm), FUN = function(x) identical(x, ii), logical(1))
## [1] TRUE TRUE TRUE TRUE
#
stopifnot(all(vapply(list(jj, kk, ll, mm), FUN = function(x) identical(x, ii), logical(1))))# #Rules with more control and oversight. RHS: "Foundation=Yes"
rr_sup <- 0.1
rr_conf <- 0.7
rules <- suppressWarnings(apriori(xw,
parameter = list(minlen = 2, maxlen = 3, support = rr_sup, confidence = rr_conf),
appearance = list(rhs = paste0(names(xw)[11], "=", levels(xw[[11]])[2]),
default = "lhs")))| SN | LHS_Antecedent | x | RHS_Consequent | Support | Confidence | Coverage | Lift | Count |
|---|---|---|---|---|---|---|---|---|
| 1 | {Lip Gloss=Yes} | => | {Foundation=Yes} | 0.356 | 0.727 | 0.49 | 1.36 | 356 |
| 2 | {Lip Gloss=Yes, Lipstick=Yes} | => | {Foundation=Yes} | 0.116 | 0.734 | 0.158 | 1.37 | 116 |
| 3 | {Mascara=Yes, Lip Gloss=Yes} | => | {Foundation=Yes} | 0.13 | 0.718 | 0.181 | 1.34 | 130 |
| 4 | {Eye shadow=Yes, Lip Gloss=Yes} | => | {Foundation=Yes} | 0.146 | 0.726 | 0.201 | 1.36 | 146 |
| 5 | {Lip Gloss=Yes, Eyeliner=No} | => | {Foundation=Yes} | 0.2 | 0.76 | 0.263 | 1.42 | 200 |
| 6 | {Concealer=No, Lip Gloss=Yes} | => | {Foundation=Yes} | 0.215 | 0.79 | 0.272 | 1.47 | 215 |
| 7 | {Eye shadow=No, Lip Gloss=Yes} | => | {Foundation=Yes} | 0.21 | 0.727 | 0.289 | 1.36 | 210 |
| 8 | {Blush=No, Lip Gloss=Yes} | => | {Foundation=Yes} | 0.237 | 0.76 | 0.312 | 1.42 | 237 |
| 9 | {Mascara=No, Lip Gloss=Yes} | => | {Foundation=Yes} | 0.226 | 0.731 | 0.309 | 1.36 | 226 |
| 10 | {Lip Gloss=Yes, Lipstick=No} | => | {Foundation=Yes} | 0.24 | 0.723 | 0.332 | 1.35 | 240 |
| 11 | {Nail Polish=No, Lip Gloss=Yes} | => | {Foundation=Yes} | 0.267 | 0.75 | 0.356 | 1.4 | 267 |
| 12 | {Bronzer=No, Lip Gloss=Yes} | => | {Foundation=Yes} | 0.295 | 0.845 | 0.349 | 1.58 | 295 |
| 13 | {Lip liner=No, Lip Gloss=Yes} | => | {Foundation=Yes} | 0.31 | 0.829 | 0.374 | 1.55 | 310 |
| 14 | {Brushes=No, Lip Gloss=Yes} | => | {Foundation=Yes} | 0.313 | 0.742 | 0.422 | 1.38 | 313 |
| 15 | {Bag=No, Lip Gloss=Yes} | => | {Foundation=Yes} | 0.335 | 0.728 | 0.46 | 1.36 | 335 |
| 16 | {Eyebrow Pencils=No, Lip Gloss=Yes} | => | {Foundation=Yes} | 0.345 | 0.728 | 0.474 | 1.36 | 345 |
# #Specific Rule: SN = 1: LHS_Antecedent {Lip Gloss=Yes} RHS_Consequent {Foundation=Yes}
ii <- xw %>% select(11, 12) %>% rename(Lip_Gloss = 2) %>% count(Foundation, Lip_Gloss)
# #490 'Lip Gloss' purchased in 1000 Total
ii %>% group_by(Lip_Gloss) %>% summarise(SUM = sum(n)) %>% mutate(PROP = SUM/sum(SUM))
## # A tibble: 2 x 3
## Lip_Gloss SUM PROP
## <fct> <int> <dbl>
## 1 No 510 0.51
## 2 Yes 490 0.49
# #536 'Foundation' purchased in 1000 Total
ii %>% group_by(Foundation) %>% summarise(SUM = sum(n)) %>% mutate(PROP = SUM/sum(SUM))
## # A tibble: 2 x 3
## Foundation SUM PROP
## <fct> <int> <dbl>
## 1 No 464 0.464
## 2 Yes 536 0.536
# #356 'Foundation' purchased within 490 'Lip Gloss' purchases
ii %>% filter(Lip_Gloss == 'Yes') %>% mutate(PROP = n/sum(n))
## # A tibble: 2 x 4
## Foundation Lip_Gloss n PROP
## <fct> <fct> <int> <dbl>
## 1 No Yes 134 0.273
## 2 Yes Yes 356 0.727# #RHS: "Foundation=Yes"
# #LHS: Only Bag Yes, Blush Yes | To identify these Rules supply lower support and confidence
# #default = "both", "lhs", "rhs", "none". Specified the default appearance for all items ...
# #...not explicitly mentioned in the other elements of the list.
# #If default = "lhs" is supplied then redundant rules come up.
rr_sup <- 0.01
rr_conf <- 0.1
rules <- suppressWarnings(apriori(xw,
parameter = list(minlen = 2, maxlen = 3, support = rr_sup, confidence = rr_conf),
appearance = list(rhs = paste0(names(xw)[11], "=", levels(xw[[11]])[2]),
lhs = c("Bag=Yes", "Blush=Yes"),
default = "none")))| SN | LHS_Antecedent | x | RHS_Consequent | Support | Confidence | Coverage | Lift | Count |
|---|---|---|---|---|---|---|---|---|
| 1 | {Bag=Yes} | => | {Foundation=Yes} | 0.031 | 0.574 | 0.054 | 1.071 | 31 |
| 2 | {Blush=Yes} | => | {Foundation=Yes} | 0.192 | 0.529 | 0.363 | 0.987 | 192 |
| 3 | {Bag=Yes, Blush=Yes} | => | {Foundation=Yes} | 0.019 | 0.594 | 0.032 | 1.108 | 19 |
# #RHS: "Foundation=Yes"
# #LHS: All Yes Only
rr_sup <- 0.01
rr_conf <- 0.1
rr_rhs <- 11L #index of "Foundation"
rules <- suppressWarnings(apriori(xw,
parameter = list(minlen = 2, maxlen = 3, support = rr_sup, confidence = rr_conf),
appearance = list(rhs = paste0(names(xw)[rr_rhs], "=", levels(xw[[rr_rhs]])[2]),
lhs = paste0(names(xw)[-rr_rhs], "=", levels(xw[[rr_rhs]])[2]),
default = "none")))| SN | LHS_Antecedent | x | RHS_Consequent | Support | Confidence | Coverage | Lift | Count |
|---|---|---|---|---|---|---|---|---|
| 1 | {Eyebrow Pencils=Yes} | => | {Foundation=Yes} | 0.019 | 0.452 | 0.042 | 0.844 | 19 |
| 2 | {Bag=Yes} | => | {Foundation=Yes} | 0.031 | 0.574 | 0.054 | 1.071 | 31 |
| 3 | {Brushes=Yes} | => | {Foundation=Yes} | 0.074 | 0.497 | 0.149 | 0.927 | 74 |
| 4 | {Lip liner=Yes} | => | {Foundation=Yes} | 0.087 | 0.372 | 0.234 | 0.694 | 87 |
| 5 | {Lipstick=Yes} | => | {Foundation=Yes} | 0.167 | 0.519 | 0.322 | 0.968 | 167 |
| 6 | {Nail Polish=Yes} | => | {Foundation=Yes} | 0.143 | 0.511 | 0.28 | 0.953 | 143 |
| 7 | {Bronzer=Yes} | => | {Foundation=Yes} | 0.133 | 0.477 | 0.279 | 0.889 | 133 |
| 8 | {Blush=Yes} | => | {Foundation=Yes} | 0.192 | 0.529 | 0.363 | 0.987 | 192 |
| 9 | {Mascara=Yes} | => | {Foundation=Yes} | 0.192 | 0.538 | 0.357 | 1.003 | 192 |
| 10 | {Eye shadow=Yes} | => | {Foundation=Yes} | 0.211 | 0.554 | 0.381 | 1.033 | 211 |
| 11 | {Eyeliner=Yes} | => | {Foundation=Yes} | 0.238 | 0.521 | 0.457 | 0.972 | 238 |
| 12 | {Lip Gloss=Yes} | => | {Foundation=Yes} | 0.356 | 0.727 | 0.49 | 1.355 | 356 |
| 13 | {Concealer=Yes} | => | {Foundation=Yes} | 0.231 | 0.523 | 0.442 | 0.975 | 231 |
| 14 | {Eyebrow Pencils=Yes, Lipstick=Yes} | => | {Foundation=Yes} | 0.013 | 0.481 | 0.027 | 0.898 | 13 |
| 15 | {Blush=Yes, Eyebrow Pencils=Yes} | => | {Foundation=Yes} | 0.014 | 0.5 | 0.028 | 0.933 | 14 |
| 16 | {Eyebrow Pencils=Yes, Mascara=Yes} | => | {Foundation=Yes} | 0.01 | 0.435 | 0.023 | 0.811 | 10 |
| 17 | {Eyebrow Pencils=Yes, Eye shadow=Yes} | => | {Foundation=Yes} | 0.012 | 0.48 | 0.025 | 0.896 | 12 |
| 18 | {Eyebrow Pencils=Yes, Lip Gloss=Yes} | => | {Foundation=Yes} | 0.011 | 0.688 | 0.016 | 1.283 | 11 |
| 19 | {Concealer=Yes, Eyebrow Pencils=Yes} | => | {Foundation=Yes} | 0.011 | 0.5 | 0.022 | 0.933 | 11 |
| 20 | {Bag=Yes, Brushes=Yes} | => | {Foundation=Yes} | 0.01 | 0.5 | 0.02 | 0.933 | 10 |
| 21 | {Bag=Yes, Lipstick=Yes} | => | {Foundation=Yes} | 0.011 | 0.688 | 0.016 | 1.283 | 11 |
| 22 | {Bag=Yes, Nail Polish=Yes} | => | {Foundation=Yes} | 0.015 | 0.536 | 0.028 | 0.999 | 15 |
| 23 | {Bag=Yes, Bronzer=Yes} | => | {Foundation=Yes} | 0.011 | 0.5 | 0.022 | 0.933 | 11 |
| 24 | {Bag=Yes, Blush=Yes} | => | {Foundation=Yes} | 0.019 | 0.594 | 0.032 | 1.108 | 19 |
| 25 | {Bag=Yes, Mascara=Yes} | => | {Foundation=Yes} | 0.022 | 0.579 | 0.038 | 1.08 | 22 |
| 26 | {Bag=Yes, Eye shadow=Yes} | => | {Foundation=Yes} | 0.016 | 0.5 | 0.032 | 0.933 | 16 |
| 27 | {Bag=Yes, Eyeliner=Yes} | => | {Foundation=Yes} | 0.015 | 0.517 | 0.029 | 0.965 | 15 |
| 28 | {Bag=Yes, Lip Gloss=Yes} | => | {Foundation=Yes} | 0.021 | 0.7 | 0.03 | 1.306 | 21 |
| 29 | {Bag=Yes, Concealer=Yes} | => | {Foundation=Yes} | 0.021 | 0.6 | 0.035 | 1.119 | 21 |
| 30 | {Brushes=Yes, Lip liner=Yes} | => | {Foundation=Yes} | 0.023 | 0.338 | 0.068 | 0.631 | 23 |
| 31 | {Brushes=Yes, Lipstick=Yes} | => | {Foundation=Yes} | 0.028 | 0.571 | 0.049 | 1.066 | 28 |
| 32 | {Nail Polish=Yes, Brushes=Yes} | => | {Foundation=Yes} | 0.074 | 0.497 | 0.149 | 0.927 | 74 |
| 33 | {Brushes=Yes, Bronzer=Yes} | => | {Foundation=Yes} | 0.046 | 0.474 | 0.097 | 0.885 | 46 |
| 34 | {Blush=Yes, Brushes=Yes} | => | {Foundation=Yes} | 0.039 | 0.557 | 0.07 | 1.039 | 39 |
| 35 | {Brushes=Yes, Mascara=Yes} | => | {Foundation=Yes} | 0.039 | 0.47 | 0.083 | 0.877 | 39 |
| 36 | {Brushes=Yes, Eye shadow=Yes} | => | {Foundation=Yes} | 0.037 | 0.457 | 0.081 | 0.852 | 37 |
| 37 | {Brushes=Yes, Eyeliner=Yes} | => | {Foundation=Yes} | 0.034 | 0.436 | 0.078 | 0.813 | 34 |
| 38 | {Brushes=Yes, Lip Gloss=Yes} | => | {Foundation=Yes} | 0.043 | 0.632 | 0.068 | 1.18 | 43 |
| 39 | {Brushes=Yes, Concealer=Yes} | => | {Foundation=Yes} | 0.044 | 0.478 | 0.092 | 0.892 | 44 |
| 40 | {Lip liner=Yes, Lipstick=Yes} | => | {Foundation=Yes} | 0.025 | 0.333 | 0.075 | 0.622 | 25 |
| 41 | {Nail Polish=Yes, Lip liner=Yes} | => | {Foundation=Yes} | 0.036 | 0.387 | 0.093 | 0.722 | 36 |
| 42 | {Bronzer=Yes, Lip liner=Yes} | => | {Foundation=Yes} | 0.046 | 0.359 | 0.128 | 0.67 | 46 |
| 43 | {Blush=Yes, Lip liner=Yes} | => | {Foundation=Yes} | 0.051 | 0.411 | 0.124 | 0.767 | 51 |
| 44 | {Lip liner=Yes, Mascara=Yes} | => | {Foundation=Yes} | 0.042 | 0.393 | 0.107 | 0.732 | 42 |
| 45 | {Lip liner=Yes, Eye shadow=Yes} | => | {Foundation=Yes} | 0.042 | 0.375 | 0.112 | 0.7 | 42 |
| 46 | {Lip liner=Yes, Eyeliner=Yes} | => | {Foundation=Yes} | 0.046 | 0.354 | 0.13 | 0.66 | 46 |
| 47 | {Lip liner=Yes, Lip Gloss=Yes} | => | {Foundation=Yes} | 0.046 | 0.397 | 0.116 | 0.74 | 46 |
| 48 | {Concealer=Yes, Lip liner=Yes} | => | {Foundation=Yes} | 0.07 | 0.391 | 0.179 | 0.73 | 70 |
| 49 | {Nail Polish=Yes, Lipstick=Yes} | => | {Foundation=Yes} | 0.051 | 0.573 | 0.089 | 1.069 | 51 |
| 50 | {Bronzer=Yes, Lipstick=Yes} | => | {Foundation=Yes} | 0.044 | 0.489 | 0.09 | 0.912 | 44 |
# #Paste String "=Yes" to each element of a Vector except "Foundation"
names(xw)
## [1] "Bag" "Blush" "Nail Polish" "Brushes" "Concealer"
## [6] "Eyebrow Pencils" "Bronzer" "Lip liner" "Mascara" "Eye shadow"
## [11] "Foundation" "Lip Gloss" "Lipstick" "Eyeliner"
paste0(names(xw)[-11], "=", levels(xw[[11]])[2])
## [1] "Bag=Yes" "Blush=Yes" "Nail Polish=Yes" "Brushes=Yes"
## [5] "Concealer=Yes" "Eyebrow Pencils=Yes" "Bronzer=Yes" "Lip liner=Yes"
## [9] "Mascara=Yes" "Eye shadow=Yes" "Lip Gloss=Yes" "Lipstick=Yes"
## [13] "Eyeliner=Yes"Please import the "B23-Champo.csv".
xxB23Champo <- c("xxB23Champo_2_Both", "xxB23Champo_3_Order", "xxB23Champo_4_Sample",
"xxB23Champo_6_Cluster", "xxB23Champo_5_Reco", "xxB23Champo_7_Colours",
"xxB23Champo_8_SKU", "xxB23Champo_9_RecoTrans")# #Dimensions of all of these datasets
str(lapply(xxB23Champo, function(x) {dim(eval(parse(text = x)))}))
## List of 8
## $ : int [1:2] 18955 16
## $ : int [1:2] 13135 12
## $ : int [1:2] 5820 25
## $ : int [1:2] 20 21
## $ : int [1:2] 45 14
## $ : int [1:2] 11 8
## $ : int [1:2] 11 8
## $ : int [1:2] 20 21# #Path to the Excel File #read_delim(clipboard())
loc_src <- paste0(.z$XL, "B23-Champo.xlsx")
#excel_sheets(loc_src)
# #Read Sheets
xxB23Champo_2_Both <- read_excel(path = loc_src, sheet = 2)
xxB23Champo_3_Order <- read_excel(path = loc_src, sheet = 3)
xxB23Champo_4_Sample <- read_excel(path = loc_src, sheet = 4)
xxB23Champo_6_Cluster <- read_excel(path = loc_src, sheet = 6)
xxB23Champo_5_Reco <- read_excel(path = loc_src, sheet = 5, range = "A1:U21")
xxB23Champo_7_Colours <- read_excel(path = loc_src, sheet = 7, range = "A1:H12")
xxB23Champo_8_SKU <- read_excel(path = loc_src, sheet = 7, range = "J1:Q12")
xxB23Champo_9_RecoTrans <- read_excel(path = loc_src, sheet = 5, range = "X1:AR21")# #Save the Loaded data as Binary Files
for(ii in xxB23Champo){
saveRDS(eval(parse(text = ii)), paste0(.z$XL, ii, ".rds"))
}Process (3 weeks to 3 months) : Design \(\Rightarrow\) CAD (Visual, Material) \(\Rightarrow\) Procurement \(\Rightarrow\) Warehousing \(\Rightarrow\) Dying \(\Rightarrow\) Storage of Dyed Yarn \(\Rightarrow\) Preparation for Weaving or Hand-Tufting \(\Rightarrow\) Wounding \(\Rightarrow\) Finishing (edges etc.) \(\Rightarrow\) inspection \(\Rightarrow\) Dispatch.
Product categories (4 major) - hand-tufted carpets (least effort, most popular), hand knotted carpets (skilled, most expensive), Kilims (woolen, expensive) and Durries (Indian variant)
Company sent samples to the client as per …
Cost-efficient way of selecting appropriate sample designs that could generate maximum revenue.
Belief: carpet attributes could be used for creating customer segments, which in turn could be used for developing models such as classification to identify customer preferences and recommendation systems
to identify the most important customers and the most important products and find a way to connect the two using suitable attributes from data and appropriate analytics models
Please import the "B23-FantasySports.csv".
xxB23Sports <- c("xxB23Sports_Q3_2T_Paid", "xxB23Sports_Q3_2T_Free", "xxB23Sports_Q4_2T",
"xxB23Sports_Q5_Chi_Player", "xxB23Sports_Q5_Chi_Captain", "xxB23Sports_Q6_2T_119_Select",
"xxB23Sports_Q6_2T_119_NotSelect", "xxB23Sports_Q6_2T_6_Select", "xxB23Sports_Q6_2T_6_NotSelect",
"xxB23Sports_Q7_Anova_NotSelect", "xxB23Sports_Q7_Anova_Captain", "xxB23Sports_Q7_Anova_VC",
"xxB23Sports_Q8_Regression")# #Dimensions of all of these datasets.
#sapply(lapply(xxChampo, function(x) {dim(eval(parse(text = x)))}), "[[", 1)
str(lapply(xxB23Sports, function(x) {dim(eval(parse(text = x)))}))
## List of 13
## $ : int [1:2] 5180 3
## $ : int [1:2] 8288 3
## $ : int [1:2] 72 15
## $ : int [1:2] 10 16
## $ : int [1:2] 10 16
## $ : int [1:2] 223738 2
## $ : int [1:2] 22087 2
## $ : int [1:2] 279868 2
## $ : int [1:2] 159890 2
## $ : int [1:2] 178691 3
## $ : int [1:2] 225474 3
## $ : int [1:2] 85710 3
## $ : int [1:2] 55272 5# #Path to the Excel File
loc_src <- paste0(.z$XL, "B23-FantasySports.xlsx")
# #Read Sheets
xxB23Sports_Q3_2T_Paid <- read_excel(path = loc_src, sheet = 2, range = "A8:C5188")
xxB23Sports_Q3_2T_Free <- read_excel(path = loc_src, sheet = 2, range = "E8:G8296")
xxB23Sports_Q4_2T <- read_excel(path = loc_src, sheet = 3, range = "A8:O80")
xxB23Sports_Q5_Chi_Player <- read_excel(path = loc_src, sheet = 4, range = "A15:P25")
xxB23Sports_Q5_Chi_Captain <- read_excel(path = loc_src, sheet = 4, range = "A31:P41")
xxB23Sports_Q8_Regression <- read_excel(path = loc_src, sheet = 7, range = "A7:E55279")
#
# #Create CSV Files because of package failure in reading large excel data
tbl <- read_csv(paste0(.z$XL, "B23-Sports-Q6-2T-119-Select", ".csv"), show_col_types = FALSE)
attr(tbl, "spec") <- NULL
attr(tbl, "problems") <- NULL
xxB23Sports_Q6_2T_119_Select <- tbl
tbl <- read_csv(paste0(.z$XL, "B23-Sports-Q6-2T-119-NotSelect", ".csv"), show_col_types = FALSE)
attr(tbl, "spec") <- NULL
attr(tbl, "problems") <- NULL
xxB23Sports_Q6_2T_119_NotSelect <- tbl
tbl <- read_csv(paste0(.z$XL, "B23-Sports-Q6-2T-6-Select", ".csv"), show_col_types = FALSE)
attr(tbl, "spec") <- NULL
attr(tbl, "problems") <- NULL
xxB23Sports_Q6_2T_6_Select <- tbl
tbl <- read_csv(paste0(.z$XL, "B23-Sports-Q6-2T-6-NotSelect", ".csv"), show_col_types = FALSE)
attr(tbl, "spec") <- NULL
attr(tbl, "problems") <- NULL
xxB23Sports_Q6_2T_6_NotSelect <- tbl
#
tbl <- read_csv(paste0(.z$XL, "B23-Sports-Q7-Anova-NotSelect", ".csv"), show_col_types = FALSE)
attr(tbl, "spec") <- NULL
attr(tbl, "problems") <- NULL
xxB23Sports_Q7_Anova_NotSelect <- tbl
tbl <- read_csv(paste0(.z$XL, "B23-Sports-Q7-Anova-Captain", ".csv"), show_col_types = FALSE)
attr(tbl, "spec") <- NULL
attr(tbl, "problems") <- NULL
xxB23Sports_Q7_Anova_Captain <- tbl
tbl <- read_csv(paste0(.z$XL, "B23-Sports-Q7-Anova-VC", ".csv"), show_col_types = FALSE)
attr(tbl, "spec") <- NULL
attr(tbl, "problems") <- NULL
xxB23Sports_Q7_Anova_VC <- tbl# #Save the Loaded data as Binary Files
for(ii in xxB23Sports){
saveRDS(eval(parse(text = ii)), paste0(.z$XL, ii, ".rds"))
}whether fantasy sports was a game of chance or skill, especially whether skill is a dominant factor in winning fantasy sports competition.
(Legal) The decision between skill and chance was to be decided based on whether the skill-based element was dominant over chance in determining the outcome of the game.
If a fantasy sports is chance based, then every user should have an equal probability of winning.
understand the key difference between skill and chance and how to test it using the data.
To prove that it is skill dominant, we have to prove that users who are scoring high in fantasy sports are playing a strategic game, their selection of players and captain and vice-captain is more knowledge based than random selection.
If fantasy games involve skill, then we can expect consistency in the performance of the users both low as well as high. Alternatively, we can also check whether a selection specific player increases probability of winning fantasy sports.
I think our approach should be to identify and test several possible hypotheses to establish whether fantasy sports is skill dominant or chance dominant.
Few possible hypotheses are listed below:
11 Players, 100 credits, Captain 2x, VC 1.5x, Max. players from a Team =7 (C1 … C7)
Please import the "B23-RFM.csv".
As a data scientist, you would like to analyse recency, frequency, and monetary value of an online store. Based on the same, you would like to suggest suitable market segments so that the online store can implement marketing actions efficiently and effectively. In this attempt, use the data (See “B23-RFM.csv”), perform the RFM analysis, provide practical /managerial recommendations.
bb <- aa <- xxB23RFM
# #Convert to Date
bb$ORDERDATE <- dmy(bb$ORDERDATE)
# #Get Analysis Date as the Next Date after the Max Date in the Data
analysis_date <- max(bb$ORDERDATE) + 1 #as_date("2005-12-02")
#
# #RFM analysis by rfm_table_order()
rfm_result <- rfm_table_order(bb, customer_id = CUSTOMERNAME, order_date = ORDERDATE,
revenue = SALES, analysis_date = analysis_date)
# #Output is a Tibble with some other attributes
loc_src <- paste0(.z$XL, "B23-Results-RFM.csv")
# #Save the Result in a CSV
if(FALSE) write_csv(rfm_result$rfm, file = loc_src)# #Developing segments
segment_titles <- c("First Grade", "Loyal", "Likely to be Loyal", "New Ones",
"Could be Promising", "Require Assistance", "Getting Less Frequent",
"Almost Out", "Can not Lose Them", "Do not Show Up at All")
# #Define Rules of Minimum and Maximum RFM for each group
r_low <- c(4, 2, 3, 4, 3, 2, 2, 1, 1, 1)
r_high <- c(5, 5, 5, 5, 4, 3, 3, 2, 1, 2)
f_low <- c(4, 3, 1, 1, 1, 2, 1, 2, 4, 1)
f_high <- c(5, 5, 3, 1, 1, 3, 2, 5, 5, 2)
m_low <- c(4, 3, 1, 1, 1, 2, 1, 2, 4, 1)
m_high <- c(5, 5, 3, 1, 1, 3, 2, 5, 5, 2)
#
stopifnot(all(vapply(list(r_low, r_high, f_low, f_high, m_low, m_high),
FUN = function(x) identical(length(x), length(segment_titles)), logical(1))))divisions <- rfm_segment(rfm_result, segment_names = segment_titles,
recency_lower = r_low, recency_upper = r_high,
frequency_lower = f_low, frequency_upper = f_high,
monetary_lower = m_low, monetary_upper = m_high)
# #Output is a Tibble
# #Save the Result in a CSV
loc_src <- paste0(.z$XL, "B23-Results-Divisions.csv")
if(FALSE) write_csv(divisions, file = loc_src)
#
# #We defined 10 segments, However only 7 (+1) of them are represented in the data
# #and 48 customers were not captured by our classifications. These were assigned to 'Others'
divisions %>%
count(segment) %>%
mutate(PCT = round(100 * n / sum(n), 1)) %>%
rename(SEGMENT = segment, FREQ = n) %>%
arrange(desc(FREQ))
## # A tibble: 8 x 3
## SEGMENT FREQ PCT
## <chr> <int> <dbl>
## 1 First Grade 22 23.9
## 2 Likely to be Loyal 21 22.8
## 3 Loyal 20 21.7
## 4 Almost Out 10 10.9
## 5 Do not Show Up at All 7 7.6
## 6 Getting Less Frequent 7 7.6
## 7 Require Assistance 4 4.3
## 8 Others 1 1.1
#if(FALSE) {#Histograms of Median RFM for each Segment
hh <- divisions
rfm_plot_median_recency(hh)
rfm_plot_median_frequency(hh)
rfm_plot_median_monetary(hh)
}
if(FALSE) {
hh <- rfm_result
rfm_histograms(hh) #Histograms of RFM
rfm_order_dist(hh) #Histograms of Customer Orders i.e. Frequency
rfm_heatmap(hh) #Heatmap of Monetary on Axes of Recency and Frequency. Slighly Useful
rfm_bar_chart(hh) #Bar Charts with Facetting of RFM
# #Scatter Plots among Recency, Monetary, Frequency
rfm_rm_plot(hh)
rfm_fm_plot(hh)
rfm_rf_plot(hh)
}colSums(is.na(aa)) %>% as_tibble(rownames = "Cols") %>% filter(value > 0)
## # A tibble: 4 x 2
## Cols value
## <chr> <dbl>
## 1 ADDRESSLINE2 2521
## 2 STATE 1486
## 3 POSTALCODE 76
## 4 TERRITORY 1074Dream 11 platform has both free and paid users, that is, users who play games for free with no return and users who pay a fee and obtain returns at the end of the game based on their relative performance. Can the average scores of paid and free users can help Dream 11 in testing skill- based game. (See Sheet “Qns_3_2SampleTTest” of “B23-FantasySports.xlsx”)
30.2 \(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\mu}_1 - {\mu}_2 \geq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 < {D_0}\)
# #Data
free <- xxB23Sports_Q3_2T_Free$userpoints
paid <- xxB23Sports_Q3_2T_Paid$userpoints
#
# #Sample Information
round(vapply(f_namedList(free, paid),
FUN = function(x) {c(N = length(x), Mean = mean(x), SD = sd(x))},
FUN.VALUE = numeric(3)), 1)
## free paid
## N 8288.0 5180.0
## Mean 289.5 301.2
## SD 91.6 74.9
#
# #Welch Two Sample t-test
ha_bb <- "less" #"two.sided" (Default), "less", "greater"
testT_bb <- t.test(x = free, y = paid, alternative = ha_bb)
testT_bb
##
## Welch Two Sample t-test
##
## data: free and paid
## t = -8.0797, df = 12539, p-value = 3.543e-16
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf -9.313015
## sample estimates:
## mean of x mean of y
## 289.5123 301.2061## p-value (0) is less than alpha (0.05).
## We can reject the H0 with 95% confidence. The populations are different.
# #Contest Type Comparison for Free
bb <- xxB23Sports_Q3_2T_Free %>%
select(key = 2, value = 3) %>%
filter(key == "public" | key == "private") %>%
mutate(across(key, factor))
#
# #Sample Information
bb %>% group_by(key) %>% summarise(N = n(), Mean = round(mean(value), 1), SD = round(sd(value), 1))
## # A tibble: 2 x 4
## key N Mean SD
## <fct> <int> <dbl> <dbl>
## 1 private 566 310. 89.5
## 2 public 7710 288 91.6
#
# #Welch Two Sample t-test
ha_bb <- "two.sided" #"two.sided" (Default), "less", "greater"
testT_bb <- t.test(formula = value ~ key, data = bb, alternative = ha_bb)
testT_bb
##
## Welch Two Sample t-test
##
## data: value by key
## t = 5.6664, df = 654.91, p-value = 2.186e-08
## alternative hypothesis: true difference in means between group private and group public is not equal to 0
## 95 percent confidence interval:
## 14.45501 29.78603
## sample estimates:
## mean in group private mean in group public
## 310.0967 287.9762## p-value (0) is less than alpha (0.05).
## We can reject the H0 with 95% confidence. The populations are different.
# #Contest Type Comparison for Paid
bb <- xxB23Sports_Q3_2T_Paid %>%
select(key = 2, value = 3) %>%
filter(key == "public" | key == "private") %>%
mutate(across(key, factor))
#
# #Sample Information
bb %>% group_by(key) %>% summarise(N = n(), Mean = round(mean(value), 1), SD = round(sd(value), 1))
## # A tibble: 2 x 4
## key N Mean SD
## <fct> <int> <dbl> <dbl>
## 1 private 147 286. 104.
## 2 public 1116 302. 62.7
#
# #Welch Two Sample t-test
ha_bb <- "two.sided" #"two.sided" (Default), "less", "greater"
testT_bb <- t.test(formula = value ~ key, data = bb, alternative = ha_bb)
testT_bb
##
## Welch Two Sample t-test
##
## data: value by key
## t = -1.8411, df = 160.22, p-value = 0.06745
## alternative hypothesis: true difference in means between group private and group public is not equal to 0
## 95 percent confidence interval:
## -33.536271 1.175356
## sample estimates:
## mean in group private mean in group public
## 285.6759 301.8564## p-value (0.06745) is greater than alpha (0.05). We failed to reject H0. We cannot conclude that the populations are different.
# #Compare p-value with alpha = 0.05
alpha <- 0.05
if(any(all(ha_bb == "two.sided", testT_bb$p.value >= alpha / 2),
all(ha_bb != "two.sided", testT_bb$p.value >= alpha))) {
cat(paste0("p-value (", round(testT_bb$p.value, 6), ") is greater than alpha (", alpha,
"). We failed to reject H0. We cannot conclude that the populations are different.\n"))
} else {
cat(paste0("p-value (", round(testT_bb$p.value, 6), ") is less than alpha (", alpha,
").\nWe can reject the H0 with 95% confidence. The populations are different.\n"))
}Scores of users who use some strategy to select players such as recent performance of players are higher than users who select players randomly. (See Sheet “Qns_4_2SampleTTest” of “B23-FantasySports.xlsx”)
30.2 \(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\mu}_1 - {\mu}_2 \geq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 < {D_0}\)
# #Team Type Comparison with correction of Typo
bb <- xxB23Sports_Q4_2T %>%
select(key = "TeamType", value = "totalpoints") %>%
mutate(across(key, str_replace, "Stratergy", "Strategy")) %>%
mutate(across(key, factor))
#
# #Sample Information
bb %>% group_by(key) %>% summarise(N = n(), Mean = round(mean(value), 1), SD = round(sd(value), 1))
## # A tibble: 2 x 4
## key N Mean SD
## <fct> <int> <dbl> <dbl>
## 1 Random 36 249. 55.1
## 2 Strategy 36 373. 42.2
#
# #Welch Two Sample t-test
ha_bb <- "less" #"two.sided" (Default), "less", "greater"
testT_bb <- t.test(formula = value ~ key, data = bb, alternative = ha_bb)
testT_bb
##
## Welch Two Sample t-test
##
## data: value by key
## t = -10.69, df = 65.563, p-value = 2.63e-16
## alternative hypothesis: true difference in means between group Random and group Strategy is less than 0
## 95 percent confidence interval:
## -Inf -104.2835
## sample estimates:
## mean in group Random mean in group Strategy
## 249.1806 372.7500## p-value (0) is less than alpha (0.05).
## We can reject the H0 with 95% confidence. The populations are different.
If fantasy-sports is a game of skill, then player performance has a major role in the player getting selected to a team as well as selection of captain or vice-captain. Using the data, can we test if selection of players in a team and getting high scores are dependent on each other. (See Sheet “Qns_5_2SampleTTest” of “B23-FantasySports.xlsx”)
# #Select | Sum | Long | Separate String | Wide | Relocate | Column To RowNames |
bb <- xxB23Sports_Q5_Chi_Player %>%
select(nTop_nSelect = 3, Top_nSelect = 4, nTop_Select = 5, Top_Select = 6) %>%
summarise(across(everything(), sum)) %>%
pivot_longer(everything()) %>%
separate(name, c("isTop", "isSelect")) %>%
pivot_wider(names_from = isSelect, values_from = value) %>%
relocate(nSelect, .after = last_col()) %>%
column_to_rownames('isTop')
bb
## Select nSelect
## nTop 1350174 109648
## Top 447240 16594
# #Chi-squared Test
chisq.test(bb)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: bb
## X-squared = 8881, df = 1, p-value < 2.2e-16In the data supplied, a few users have selected one of the top three high performing players as captain or vice-captain. There are also users who have not used any of these players as captain or vice-captain. Ramsu claims that choosing high performing players as captain and/or vice-captain has an impact on the scores of the users. Test this claim made by Ramsu and link it to the business problem to make an inference. (See Sheet “Qns_7_Anova” of “B23-FantasySports.xlsx”)
# #NOTE: Because of High number of rows, data was exported to CSV and then Imported
NoSelect <- xxB23Sports_Q7_Anova_NotSelect %>% drop_na(userpoints) %>% select(userpoints)
Captain <- xxB23Sports_Q7_Anova_Captain %>% drop_na(userpoints) %>% select(userpoints)
ViceCap <- xxB23Sports_Q7_Anova_VC %>% drop_na(userpoints) %>% select(userpoints)
#
# #Merge Datasets by Rows
q2p4 <- bind_rows(NoSelect = NoSelect, Captain = Captain, ViceCap = ViceCap, .id = 'Type')
# ANOVA
anv_q2p4 <- aov(userpoints ~ Type, data = q2p4)
anv_q2p4
## Call:
## aov(formula = userpoints ~ Type, data = q2p4)
##
## Terms:
## Type Residuals
## Sum of Squares 18812927 747815344
## Deg. of Freedom 2 431507
##
## Residual standard error: 41.6297
## Estimated effects may be unbalanced
#
summary(anv_q2p4)
## Df Sum Sq Mean Sq F value Pr(>F)
## Type 2 18812927 9406463 5428 <2e-16 ***
## Residuals 431507 747815344 1733
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# #Coefficients
anv_q2p4$coefficients
## (Intercept) TypeNoSelect TypeViceCap
## 297.684103 -14.373297 -3.556425Discuss clustering algorithms that can be used for segmenting customers of Champo Carpets. Apply both k-means and hierarchical clustering techniques and provide insights on the segments we can extract out these data.
Note: Use Sheet “Data Order ONLY” of “B23-Champo.xlsx.” It has 13,135 records. Use only the numerical variables (e.g., quantity required, total area, and amount) for performing cluster anlaysis.
xw <- xxB23Champo_3_Order %>% select(Quantity = QtyRequired, Area = TotalArea, Amount)
zw <- xw %>% mutate(across(everything(), ~ as.vector(scale(.))))
str(xw)
## tibble [13,135 x 3] (S3: tbl_df/tbl/data.frame)
## $ Quantity: num [1:13135] 6 6 7 7 5 6 35 5 4 7 ...
## $ Area : num [1:13135] 128 117 88 88 117 ...
## $ Amount : num [1:13135] 770 702 616 616 585 ...# #This is Slow.
hh <- zw
cap_hh <- "B23P01"
ttl_hh <- "Champo: Elbow Curve (WSS)"
#
# #factoextra::fviz_nbclust() generates ggplot
# #method = "wss" (for total within sum of square)
B23P01 <- fviz_nbclust(hh, FUNcluster = kmeans, method = "wss") +
labs(caption = cap_hh, title = ttl_hh)hh <- zw
cap_hh <- "B23P02"
ttl_hh <- "Champo: Elbow Curve (Silhouette)"
#
# #method = "silhouette" (for average silhouette width)
B23P02 <- fviz_nbclust(hh, FUNcluster = kmeans, method = "silhouette") +
labs(caption = cap_hh, title = ttl_hh)Figure 15.1 Champo: WSS and Silhouette
# #Fix Seed
set.seed(3)
# #Cluster analysis with k = {3, 6}
k3_zw <- kmeans(zw, centers = 3)
k6_zw <- kmeans(zw, centers = 6)
#
# #Save cluster membership of each point back into the dataset
res_champo <- cbind(xw,
list(k3 = k3_zw$cluster, k6 = k6_zw$cluster)) %>% as_tibble()# #Three Clusters
ii <- k3_zw
ii$size
## [1] 9038 4043 54
paste0("Between /Total = ", round(100 * ii$betweenss / ii$totss, 2), "%")
## [1] "Between /Total = 51.52%"
round(ii$centers, 3)
## Quantity Area Amount
## 1 0.012 -0.566 -0.063
## 2 -0.138 1.262 -0.012
## 3 8.366 0.229 11.447# #Six Clusters
ii <- k6_zw
ii$size
## [1] 6767 2862 1052 29 28 2397
paste0("Between /Total = ", round(100 * ii$betweenss / ii$totss, 2), "%")
## [1] "Between /Total = 79.95%"
round(ii$centers, 3)
## Quantity Area Amount
## 1 0.038 -0.735 -0.074
## 2 -0.124 0.920 0.007
## 3 -0.174 2.301 -0.036
## 4 -0.095 0.897 18.960
## 5 18.077 -0.651 1.816
## 6 -0.093 -0.035 -0.033Figure 15.2 Champo: k-means with k=3
Figure 15.3 Champo: k-means with k=6
str(zw)
## tibble [13,135 x 3] (S3: tbl_df/tbl/data.frame)
## $ Quantity: num [1:13135] -0.168 -0.168 -0.164 -0.164 -0.173 ...
## $ Area : num [1:13135] 1.667 1.442 0.864 0.864 1.442 ...
## $ Amount : num [1:13135] -0.0964 -0.1004 -0.1055 -0.1055 -0.1074 ...
#
# #Create distance matrix
dist_zw <- dist(zw)
#
hclust_com_zw <- hclust(dist_zw, method = "complete")
#hclust_avg_zw <- hclust(dist_zw, method = "average")
#hclust_sng_zw <- hclust(dist_zw, method = "single")
#
# #Cut Tree by Cluster membership
k3_com_zw <- cutree(hclust_com_zw, 3)
k4_com_zw <- cutree(hclust_com_zw, 4)
k6_com_zw <- cutree(hclust_com_zw, 6)
#
# #Save cluster membership of each point back into the dataset
hrc_champo <- cbind(xw, list(k3 = k3_com_zw, k4 = k4_com_zw, k6 = k6_com_zw)) %>% as_tibble()
#
# #Cluster Mean
if(FALSE) aggregate(zw, by = list(k3_com_zw), FUN = function(x) round(mean(x), 3))
# #Equivalent
hrc_champo %>% select(-c(k4, k6)) %>% group_by(k3) %>%
summarise(N = n(), across(everything(), mean))
## # A tibble: 3 x 5
## k3 N Quantity Area Amount
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 13075 35.5 44.7 1592.
## 2 2 31 22.5 88.4 310634.
## 3 3 29 4097. 12.9 33562.\({Y}\): Scalar response, Dependent variable, Outcome variable, Target, Predicted
\({X}\): Explanatory variables, Independent variables, Antecendent variables, Predictors
It is applied when the objective is to predict the outcome variable based on the antecendent variables
Predicted can be continuous, but Predictors can be either continuous or categorical
At least one chunk is throwing an error, probably the vif(). But it needs to be checked.
# #Object Names for each sheet
namesXL <- c("xxB26Hdesc", "xxB26Hraw", "xxB26Hmod")
# #Dimensions of these datasets
str(lapply(namesXL, function(x) {dim(eval(parse(text = x)))}))
## List of 3
## $ : int [1:2] 58 2
## $ : int [1:2] 248 24
## $ : int [1:2] 248 51# #
bb <- aa <- xxB26CarDekho
str(bb)# #To keep only the First Word of NAME Column. Warnings are Present
bb %>% separate(name, c("Brand", NA), sep = " ",remove = FALSE) # Select Relevant | Calculate Age
ii <- bb %>% select(-1, 4:7) %>% mutate(Age = 2021 - year)
#unique(ii$owner)
#anyNA(ii)
#
jj <- dummy_cols(ii, select_columns = c("seller_type", "fuel", "seller_type", "transmission", "owner"), remove_first_dummy = TRUE)
#
## Calculate AGE## Step 1: Create the training and test data
# Create Training and Test data
set.seed(3)
train_row_ii <- sample(1:nrow(ii), 0.8*nrow(ii)) # row indices for training data: 80:20
train_ii <- ii[train_row_ii, ] # model training data
test_ii <- ii[-train_row_ii, ] # test data
#
train_row_jj <- sample(1:nrow(jj), 0.8*nrow(jj)) # row indices for training data: 80:20
train_jj <- jj[train_row_jj, ] # model training data
test_jj <- jj[-train_row_jj, ] # test dataThis is a validation process. Train to Train Mode. Test to check model validity. If there is High Model Overfitting, then if I try another data, I will have discrepencies.
##Step 2: Fit the model on training data and predict selling_price
# Build the model on training data
# #Exclude 4th Column because, actually 14th, only 1 observation in original dataset that went to test not to the training test
which(names(train_jj) == "fuel_Electric")
#model1_ii <- lm(selling_price ~ ., data = train_ii[,-4]) # build the model
model1_jj <- lm(selling_price ~ ., data = train_jj[, -14]) # build the model
#
names(model1_jj)[which(!names(model1_jj) %in% c("residuals", "effects", "fitted.values", "qr", "model"))]
# #
#[1] "coefficients" "residuals" "effects" "rank" "fitted.values" "assign" "qr" #[8] "df.residual" "contrasts" "xlevels" "call" "terms" "model"
#summary(model1_jj)
summary(model1_jj)$coefficients
#
# #We lose the stars doing this and the function needs to show 0 for very low values
f_pNum(summary(model1_jj)$coefficients)
#
#options(scipen=10)
#format(model1_jj$coefficients, scientific = FALSE)
#format(model1_jj$contrasts, scientific = FALSE)Can we tell the model that at least two of each dummy variables should be included in the training dataset would it be considered as Biased model - do it in Catgortical Y not in continuos Y - if x is gender and out of 1000 observations 999 are male can we do anything with gender. Data itself is telling that gender is male overall - look at the significance
#model1_jj
#library(car)
# while making the model, please assure that there is no multicollinearity
# it means the IVs are not correlated in a high fashion.
#library (psych)
pairs.panels(train_jj[,-1], cex = 4)
vif(model1_jj)
#vif(model1)# vif stands for varience inflation factor, it is a measure of multicollinearity:
#if any variable carry vif greater than 5, which indicate that there is a multicollinearity issue# #this does not work because we have dummy variables
#the linearly dependent variables
ld.vars <- attributes(alias(model1_jj)$Complete)$dimnames[[1]]
#remove the linearly dependent variables variables
formula.new <- as.formula(
paste(
paste(deparse(formula), collapse=""),
paste(ld.vars, collapse="-"),
sep="-"
)
)
#run model again
fit.new <-lm(formula.new)
vif(fit.new)Non-Normal
Normal Now (Prof got p value less than 0.05)
aa <- xxB26KC
#
names(aa)
bb <- aa %>% drop_na() %>% select(-c(id, view, zipcode, lat, long)) %>% mutate(Sold = year(date), Age = Sold - yr_built) %>% relocate(Age, Sold) %>% mutate(isRenew = ifelse(yr_renovated == 0, 0, 1)) %>% relocate(isRenew) %>% rename(Beds = bedrooms, Baths = bathrooms, sqLiv = sqft_living, sqLot = sqft_lot) %>% select(-date, -Sold, -yr_renovated) %>% relocate(price)
if(FALSE) str(bb)
if(FALSE) summary(bb)
if(TRUE) head(bb)kc_zsyw <- bb %>% mutate(across(where(is.numeric), ~ as.vector(scale(.))))
f_wl(kc_zsyw)hh <- cor(kc_zsyw)
cap_hh <- paste0("Correlation Matrix")
f_pKblM(x = hh, caption = cap_hh, negPos = c(-0.5, 0.5), dig = 3, debug = TRUE)# #IN: hh(Keys, Values),
C34 <- hh %>% { ggplot(data = ., mapping = aes(x = Keys, y = Values, fill = Keys)) +
geom_boxplot() +
k_gglayer_box +
scale_y_continuous(breaks = breaks_pretty()) +
coord_flip() +
theme(legend.position = 'none') +
labs(x = NULL, y = NULL, caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, C34)
rm(C34)
C34P01# #Remove anythind beyond 2 SD of Price
ii <- bb %>% filter(!(abs(price - median(price)) > 2*sd(price)))
# #Remove anythind beyond 2.5 Scaled of Price
jj <- bb %>% mutate(zPrice = scale(price)) %>% relocate(zPrice) %>% filter(between(zPrice, -2.5, +2.5))ii <- ii %>% select(-sqft_basement)
index_ii <- sample(1:nrow(ii), .80*nrow(ii))
train <- ii[index_ii,]
test <- ii[-index_ii,]
#
str(train)## run the linear regression model
model1 <- lm(price ~ ., data = train)
summary(model1)# #Get PRediction
predicted <- predict(model1, newdata = test)# predict the test data
table1 <- data.frame(Actual = test$price, Predicted = predicted)#createa table with actual test and predictd testmape_test <- mean(abs(table1$Actual - table1$Predicted) / table1$Actual)
accuracy_test <- 1 - mape_test
accuracy_test# #Train
#custom control parameters
custom <- trainControl(method = "repeatedcv", number = 10, repeats = 5)
#
ridge <- train(price ~ ., train, method = "glmnet",
tuneGrid = expand.grid(alpha = 0, lambda = seq(0.0001, 1, length = 5)),
trControl = custom)
# #
names(ridge)predicted_ridge <- predict(ridge, newdata = test)# predict the test data
table_ridge <- data.frame(Actual = test$price, Predicted = predicted_ridge)#createa table with actual test and predictd test
str(table_ridge)
#
#
mape_ridge <- mean(abs(table_ridge$Actual - table_ridge$Predicted) / table_ridge$Actual)
accuracy_ridge <- 1 - mape_ridge
accuracy_ridge #73.3%
custom <- trainControl(method = "repeatedcv", number = 10, repeats = 5)
lasso <- train (price ~., train, method = "glmnet",
tuneGrid = expand.grid(alpha = 1, lambda = seq(0.0001, 1, length = 5)),
trControl = custom)
#predicted_lasso <- predict(lasso, newdata = test)# predict the test data
table_lasso <- data.frame(Actual = test$price, Predicted = predicted_lasso)#createa table with actual test and predictd test
str(table_lasso)
mape_lasso <- mean(abs(table_lasso$Actual - table_lasso$Predicted) / table_lasso$Actual)
accuracy_lasso <- 1 - mape_lasso
accuracy_lasso #73.7%accuracy_lasso
accuracy_ridge
accuracy_test
accuracy_elscustom <- trainControl(method = "repeatedcv", number = 10, repeats = 5)
#lasso <- train (price ~., train, method = "glmnet",
#tuneGrid = expand.grid(alpha = 1, lambda = seq(0.0001, 1, length = 5)),
#trControl = custom)
elastic <- train (price ~ ., train, method = "glmnet",
tuneGrid = expand.grid(alpha = seq(0, 1, length = 10),
lambda = seq(0.0001, 1, length = 5)), trControl = custom)
predicted_els <- predict(elastic, newdata = test)# predict the test data
table_els <- data.frame(Actual = test$price, Predicted = predicted_els)#createa table with actual test and predictd test
str(table_els)
mape_els <- mean(abs(table_els$Actual - table_els$Predicted) / table_els$Actual)
accuracy_els <- 1 - mape_els
accuracy_els #73.7% 0.737696DATASET_2: This dataset provides features (related to demographics and buying behaviour) that are very relevant to predicting an auto insurance company’s customer lifetime value (CLV) . For more details about the features, see the dataset. Following the given dataset, apply customer lifetime value as the target/dependent variable, use relevant features provided in the data as the independent variables, and develop predictive models. Considering the given situation as the regression problem, execute linear regression to predict the target variable. Use the results and provide necessary recommendations.
Definitions and Exercises are from the Book (David R. Anderson 2018)
Hence, the total number of data items can be determined by multiplying the number of observations by the number of variables.
For Example, Gender as Male and Female. In cases where the scale of measurement is nominal, a numerical code as well as a nonnumerical label may be used. For example, 1 denotes Male, 2 denotes Female. The scale of measurement is nominal even though the data appear as numerical values. Only Mode can be calculated.
For example, Size as small, medium, large. Along with the labels, similar to nominal data, the data can also be ranked or ordered, which makes the measurement scale ordinal. Ordinal data can also be recorded by a numerical code. Median can be calculated but not the Mean.
Interval data are always numerical. These can be ranked or ordered like ordinal. In addition, the differences between them are meaningful.
Variables such as distance, height, weight, and time use the ratio scale of measurement. This scale requires that a zero value be included to indicate that nothing exists for the variable at the zero point. Mean can be calculated.
For example, consider the cost of an automobile. A zero value for the cost would indicate that the automobile has no cost and is free. In addition, if we compare the cost of 30,000 dollars for one automobile to the cost of 15,000 dollars for a second automobile, the ratio property shows that the first automobile is 30000/15000 = 2 times, or twice, the cost of the second automobile.
See Table 21.1 for more details.
Interval scale is a measure of continuous quantitative data that has an arbitrary 0 reference point. This is contrasted with ratio scaled data which have a non-arbitrary 0 reference point. Ex: When we look at “profit” we see that negative profit does make sense to us. So while, the 0 for “profit” is meaningful (just like temperature measurements of Celsius) it is arbitrary. Therefore, it is Interval scale of measurement.
In an interval scale, you can take difference of two values. You may not be able to take ratios of two values. Ex: Temperature in Celsius. You can say that if temperatures of two places are 40 °C and 20 °C, then one is hotter than the other (taking difference). But you cannot say that first is twice as hot as the second (not allowed to take ratio).
In a ratio scale, you can take a ratio of two values. Ex: 40 kg is twice as heavy as 20 kg (taking ratios).
Also, “0” on ratio scale means the absence of that physical quantity. “0” on interval scale does not mean the same. 0 kg means the absence of weight. 0 °C does not mean absence of heat.
| Features | Interval scale | Ratio scale |
|---|---|---|
| Variable property | Addition and subtraction | Multiplication and Division i.e. calculate ratios. Thus, you can leverage numbers on the scale against 0. |
| Absolute Point Zero | Zero-point in an interval scale is arbitrary. For example, the temperature can be below 0 °C and into negative temperatures. | The ratio scale has an absolute zero or character of origin. Height and weight cannot be zero or below zero. |
| Calculation | Statistically, in an interval scale, the Arithmetic Mean is calculated. Statistical dispersion permits range and standard deviation. The coefficient of variation is not permitted. | Statistically, in a ratio scale, the Geometric or Harmonic mean is calculated. Also, range and coefficient of variation are permitted for measuring statistical dispersion. |
| Measurement | Interval scale can measure size and magnitude as multiple factors of a defined unit. | Ratio scale can measure size and magnitude as a factor of one defined unit in terms of another. |
| Example | Temperature in Celsius, Calendar years and time, Profit | These possesses an absolute zero characteristic, like age, weight, height, or Sales |
If the variable is categorical, the statistical analysis is limited. We can summarize categorical data by counting the number of observations in each category or by computing the proportion of the observations in each category. However, even when the categorical data are identified by a numerical code, arithmetic operations do not provide meaningful results.
Arithmetic operations provide meaningful results for quantitative variables. For example, quantitative data may be added and then divided by the number of observations to compute the average value.
Quantitative data may be discrete or continuous.
As a result, the data obtained from a well-designed experiment can often provide more information as compared to the data obtained from existing sources or by conducting an observational study.
Refer Sample For More …
The population is the set of entities under study.
Instead, we could take a subset of this population called a sample and use this sample to draw inferences about the population under study, given some conditions.
Whenever statisticians use a sample to estimate a population characteristic of interest, they usually provide a statement of the quality, or precision, associated with the estimate.
Inferential statistics are used for Hypothesis Testing.
The two most common types of Statistical Inference are -
Reasoning for Tests of Significance
Analytics is used for data-driven or fact-based decision making, which is often seen as more objective than alternative approaches to decision making. The tools of analytics can aid decision making by creating insights from data, improving our ability to more accurately forecast for planning, helping us quantify risk, and yielding better alternatives through analysis.
Analytics is now generally thought to comprise three broad categories of techniques. These categories are descriptive analytics, predictive analytics, and prescriptive analytics.
Examples of these types of techniques are data queries, reports, descriptive statistics, data visualization, data dash boards, and basic what-if spreadsheet models.
Linear regression, time series analysis, and forecasting models fall into the category of predictive analytics. Simulation, which is the use of probability and statistical computer models to better understand risk, also falls under the category of predictive analytics.
Prescriptive analytics differs greatly from descriptive or predictive analytics. What distinguishes prescriptive analytics is that prescriptive models yield a best course of action to take. That is, the output of a prescriptive model is a best decision.
Optimization models, which generate solutions that maximize or minimize some objective subject to a set of constraints, fall into the category of prescriptive models.
Volume refers to the amount of available data; velocity refers to the speed at which data is collected and processed; and variety refers to the different data types. The term data warehousing is used to refer to the process of capturing, storing, and maintaining the data.
Data mining relies heavily on statistical methodology such as multiple regression, logistic regression, and correlation.
cor(), lm() etc.xxComputers <- f_getObject("xxComputers", "C01-Computers.csv", "971fb6096e4f71e8185d3327a9033a10")
xxCordless <- f_getObject("xxCordless", "C01-Cordless.csv", "9991f612fe44f1c890440bd238084679")f_getObject <- function(x_name, x_source, x_md = "") {
# #Debugging
a07bug <- FALSE
# #Read the File or Object
# #Ex: xxCars <- f_getObject("xxCars", "S16-cars2.csv", "30051fb47f65810f33cb992015b849cc")
# #tools::md5sum("xx.csv") OR tools::md5sum(paste0(.z$XL, "xx", ".txt"))
#
# #Path to the File
loc_src <- paste0(.z$XL, x_source)
# #Path to the Object
loc_rds <- paste0(.z$XL, x_name, ".rds")
#
# #x_file[1] FILENAME & x_file[2] FILETYPE
x_file <- strsplit(x_source, "[.]")[[1]]
#
if(all(x_md == tools::md5sum(loc_src), file.exists(loc_rds),
file.info(loc_src)$mtime < file.info(loc_rds)$mtime)) {
# #Read RDS if (exists, newer than source, source not modified i.e. passes md5sum)
if(a07bug) print("A07 Flag 01: Reading from RDS")
return(readRDS(loc_rds))
} else if(!file.exists(loc_src)){
message("ERROR: File does not exist! : ", loc_src, "\n")
stop()
} else if(x_file[2] == "csv") {
# #Read CSV as a Tibble
# #col_double(), col_character(), col_logical(), col_integer()
# #DATETIME (EXCEL) "YYYY-MM-DD HH:MM:SS" imported as "UTC"
tbl <- read_csv(loc_src, show_col_types = FALSE)
# #Remove Unncessary Attributes
attr(tbl, "spec") <- NULL
attr(tbl, "problems") <- NULL
# #Write Object as RDS
saveRDS(tbl, loc_rds)
# #Return Object
if(a07bug) print("A07 Flag 02: Reading from Source and Saving as RDS")
return(tbl)
} else if(x_file[2] == "xlsx") {
# #Read All Sheets of Excel in a list
tbl <- lapply(excel_sheets(loc_src), read_excel, path = loc_src)
# #Write Object as RDS
saveRDS(tbl, loc_rds)
# #Return Object
return(tbl)
} else {
message("f_getObject(): UNKNOWN")
stop()
}
}bb <- xxComputers
#displ_names <- c("")
#stopifnot(identical(ncol(bb), length(displ_names)))
#
# #Kable Table
kbl(bb,
caption = "(C01T01)",
#col.names = displ_names,
escape = FALSE, align = "c", booktabs = TRUE
) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
html_font = "Consolas", font_size = 12,
full_width = FALSE,
#position = "float_left",
fixed_thead = TRUE
) %>%
# #Header Row Dark & Bold: RGB (48, 48, 48) =HEX (#303030)
row_spec(0, color = "white", background = "#303030", bold = TRUE,
extra_css = "border-bottom: 1px solid; border-top: 1px solid"
)bb <- tibble(Company = c("Hertz", "Dollar", "Avis"),
`2007` = c(327, 167, 204), `2008` = c(311, 140, 220),
`2009` = c(286, 106, 300), `2010` = c(290, 108, 270))# #Transpose Tibble: Note that the First Column Header is lost after Transpose
# #Longer
bb %>% pivot_longer(!Company, names_to = "Year", values_to = "Values")
## # A tibble: 12 x 3
## Company Year Values
## <chr> <chr> <dbl>
## 1 Hertz 2007 327
## 2 Hertz 2008 311
## 3 Hertz 2009 286
## 4 Hertz 2010 290
## 5 Dollar 2007 167
## 6 Dollar 2008 140
## 7 Dollar 2009 106
## 8 Dollar 2010 108
## 9 Avis 2007 204
## 10 Avis 2008 220
## 11 Avis 2009 300
## 12 Avis 2010 270# #Transpose
(ii <- bb %>%
pivot_longer(!Company, names_to = "Year", values_to = "Values") %>%
pivot_wider(names_from = Company, values_from = Values))
## # A tibble: 4 x 4
## Year Hertz Dollar Avis
## <chr> <dbl> <dbl> <dbl>
## 1 2007 327 167 204
## 2 2008 311 140 220
## 3 2009 286 106 300
## 4 2010 290 108 270
# #Equivalent
stopifnot(identical(ii,
bb %>% pivot_longer(-1) %>%
pivot_wider(names_from = 1, values_from = value) %>%
rename(., Year = name)))| SN | tablet | cost | os | display_inch | battery_hh | cpu |
|---|---|---|---|---|---|---|
| 1 | Acer Iconia W510 | 599 | Windows | 10.1 | 8.5 | Intel |
| 2 | Amazon Kindle Fire HD | 299 | Android | 8.9 | 9.0 | TI OMAP |
| 3 | Apple iPad 4 | 499 | iOS | 9.7 | 11.0 | Apple |
| 4 | HP Envy X2 | 860 | Windows | 11.6 | 8.0 | Intel |
| 5 | Lenovo ThinkPad Tablet | 668 | Windows | 10.1 | 10.5 | Intel |
| 6 | Microsoft Surface Pro | 899 | Windows | 10.6 | 4.0 | Intel |
| 7 | Motorola Droid XYboard | 530 | Android | 10.1 | 9.0 | TI OMAP |
| 8 | Samsung Ativ Smart PC | 590 | Windows | 11.6 | 7.0 | Intel |
| 9 | Samsung Galaxy Tab | 525 | Android | 10.1 | 10.0 | Nvidia |
| 10 | Sony Tablet S | 360 | Android | 9.4 | 8.0 | Nvidia |
# #What is the average cost for the tablets #$582.90
cat(paste0("Avg. Cost for the tablets is = $", round(mean(bb$cost), digits = 1), "\n"))
## Avg. Cost for the tablets is = $582.9
#
# #Compare the average cost of tablets with different OS (Windows /Android) #$723.20 $428.5
(ii <- bb %>%
group_by(os) %>%
summarise(Mean = round(mean(cost), digits =1)) %>%
arrange(desc(Mean)) %>%
mutate(Mean = paste0("$", Mean)))
## # A tibble: 3 x 2
## os Mean
## <chr> <chr>
## 1 Windows $723.2
## 2 iOS $499
## 3 Android $428.5
#
cat(paste0("Avg. Cost of Tablets with Windows OS is = ",
ii %>% filter(os == "Windows") %>% select(Mean), "\n"))
## Avg. Cost of Tablets with Windows OS is = $723.2# #What percentage of tablets use an Android operating system #40%
(ii <- bb %>%
group_by(os) %>%
summarise(PCT = n()) %>%
mutate(PCT = 100 * PCT / sum(PCT)) %>%
arrange(desc(PCT)) %>%
mutate(PCT = paste0(PCT, "%")))
## # A tibble: 3 x 2
## os PCT
## <chr> <chr>
## 1 Windows 50%
## 2 Android 40%
## 3 iOS 10%
#
cat(paste0("Android OS is used in ",
ii %>% filter(os == "Android") %>% select(PCT), " Tablets\n"))
## Android OS is used in 40% Tablets| SN | brand | model | price | overall_score | voice_quality | handset_on_base | talk_time_hh |
|---|---|---|---|---|---|---|---|
| 1 | AT&T | CL84100 | 60 | 73 | Excellent | Yes | 7 |
| 2 | AT&T | TL92271 | 80 | 70 | Very Good | No | 7 |
| 3 | Panasonic | 4773B | 100 | 78 | Very Good | Yes | 13 |
| 4 | Panasonic | 6592T | 70 | 72 | Very Good | No | 13 |
| 5 | Uniden | D2997 | 45 | 70 | Very Good | No | 10 |
| 6 | Uniden | D1788 | 80 | 73 | Very Good | Yes | 7 |
| 7 | Vtech | DS6521 | 60 | 72 | Excellent | No | 7 |
| 8 | Vtech | CS6649 | 50 | 72 | Very Good | Yes | 7 |
# #What is the average price for the cordless telephones
cat(paste0("Avg. Price is = $", round(mean(bb$price), digits = 1), "\n"))
## Avg. Price is = $68.1
#
# #What is the average talk time for the cordless telephones
cat(paste0("Avg. Talk Time is = ", round(mean(bb$talk_time_hh), digits = 1), " Hours \n"))
## Avg. Talk Time is = 8.9 Hours# #What percentage of the cordless telephones have a voice quality of excellent
(hh <- bb %>%
group_by(voice_quality) %>%
summarise(PCT = n()) %>%
mutate(PCT = 100 * PCT / sum(PCT)) %>%
mutate(voice_quality = factor(voice_quality,
levels = c("Very Good", "Excellent"), ordered = TRUE)) %>%
arrange(desc(voice_quality)) %>%
mutate(PCT = paste0(PCT, "%")))
## # A tibble: 2 x 2
## voice_quality PCT
## <ord> <chr>
## 1 Excellent 25%
## 2 Very Good 75%
#
cat(paste0("Percentage of 'Excellent' Voice Quality is = ",
hh %>% filter(voice_quality == "Excellent") %>% select(PCT), "\n"))
## Percentage of 'Excellent' Voice Quality is = 25%
#
# #Equivalent
print(bb %>%
group_by(voice_quality) %>%
summarise(PCT = n()) %>%
mutate(PCT = prop.table(PCT) * 100))
## # A tibble: 2 x 2
## voice_quality PCT
## <chr> <dbl>
## 1 Excellent 25
## 2 Very Good 75# #What percentage of the cordless telephones have a handset on the base
bb %>%
group_by(handset_on_base) %>%
summarise(PCT = n()) %>%
mutate(PCT = 100 * PCT / sum(PCT)) %>%
arrange(desc(PCT)) %>%
mutate(PCT = paste0(PCT, "%")) %>%
filter(handset_on_base == "Yes")
## # A tibble: 1 x 2
## handset_on_base PCT
## <chr> <chr>
## 1 Yes 50%
|
|
bb <- tibble(Company = c("Hertz", "Dollar", "Avis"),
`2007` = c(327, 167, 204), `2008` = c(311, 140, 220),
`2009` = c(286, 106, 300), `2010` = c(290, 108, 270))
# #Transpose Tibble: Note that the First Column Header is lost after Transpose
# #Longer
hh <- bb %>% pivot_longer(!Company, names_to = "Year", values_to = "Values")
# #Transpose
ii <- bb %>%
pivot_longer(!Company, names_to = "Year", values_to = "Values") %>%
pivot_wider(names_from = Company, values_from = Values)loc_png <- paste0(.z$PX, "C01P01", "-Cars-TimeSeries", ".png")# #Load an Image
knitr::include_graphics(paste0(.z$PX, "C01P01", "-Cars-TimeSeries", ".png"))Figure 21.1 Multiple Time Series Graph
# #who appears to be the market share leader
# #how the market shares are changing over time
print(ii)
## # A tibble: 4 x 4
## Year Hertz Dollar Avis
## <chr> <dbl> <dbl> <dbl>
## 1 2007 327 167 204
## 2 2008 311 140 220
## 3 2009 286 106 300
## 4 2010 290 108 270
# #Row Total
jj <- ii %>% rowwise() %>% mutate(SUM = sum(c_across(where(is.numeric)))) %>% ungroup()
kk <- ii %>% mutate(SUM = rowSums(across(where(is.numeric))))
stopifnot(identical(jj, kk))
#
# #Rowwise Percentage Share
ii %>%
rowwise() %>%
mutate(SUM = sum(c_across(where(is.numeric)))) %>%
ungroup() %>%
mutate(across(2:4, ~ round(. * 100 / SUM, digits = 1), .names = "{.col}.{.fn}")) %>%
mutate(across(ends_with(".1"), ~ paste0(., "%")))
## # A tibble: 4 x 8
## Year Hertz Dollar Avis SUM Hertz.1 Dollar.1 Avis.1
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
## 1 2007 327 167 204 698 46.8% 23.9% 29.2%
## 2 2008 311 140 220 671 46.3% 20.9% 32.8%
## 3 2009 286 106 300 692 41.3% 15.3% 43.4%
## 4 2010 290 108 270 668 43.4% 16.2% 40.4%# #Bar Plot
aa <- bb %>%
select(Company, `2010`) %>%
rename("Y2010" = `2010`) %>%
arrange(desc(.[2])) %>%
mutate(cSUM = cumsum(Y2010)) %>%
mutate(PCT = 100 * Y2010 / sum(Y2010)) %>%
mutate(cPCT = 100 * cumsum(Y2010) / sum(Y2010)) %>%
mutate(across(Company, factor, levels = unique(Company), ordered = TRUE))
# #
pareto_chr <- setNames(c(aa$Y2010), aa$Company)
stopifnot(identical(pareto_chr, aa %>% pull(Y2010, Company)))
stopifnot(identical(pareto_chr, aa %>% select(1:2) %>% deframe()))# #Save without using ggsave()
hh <- pareto_chr
loc_png <- paste0(.z$PX, "C01P02", "-Cars-Pareto", ".png")
cap_hh <- "C01P02"
#
if(!file.exists(loc_png)) {
png(filename = loc_png)
#dev.control('enable')
pareto.chart(hh, xlab = "Company", ylab = "Cars", cumperc = seq(0, 100, by = 20),
ylab2 = "Cumulative Percentage", main = "Pareto Chart")
#title(main = ttl_hh, line = 2, adj = 0)
title(sub = cap_hh, line = 4, adj = 1)
C01P02 <- recordPlot()
dev.off()
}Figure 21.2 Pareto of Cars in 2010
# #Summarised Packages and Objects
f_()
## [1] "color_base, color_sort, color_uniq, fruit_base, fruit_sort, fruit_uniq"
#
difftime(Sys.time(), k_start)
## Time difference of 58.07009 secsThe relative frequency of a class equals the fraction or proportion of observations belonging to a class i.e. it is out of 1 whereas ‘percent frequency’ is out of 100%.
Rather than showing the frequency of each class, the cumulative frequency distribution shows the number of data items with values less than or equal to the upper class limit of each class.
ggplot() does not allow easy setup of dual axis| softdrink | Frequency | cSUM | PROP | PCT | cPCT |
|---|---|---|---|---|---|
| Coca-Cola | 19 | 19 | 38 | 38% | 38% |
| Pepsi | 13 | 32 | 26 | 26% | 64% |
| Diet Coke | 8 | 40 | 16 | 16% | 80% |
| Dr. Pepper | 5 | 45 | 10 | 10% | 90% |
| Sprite | 5 | 50 | 10 | 10% | 100% |
Figure 22.1 Bar Chart and Pie Chart of Frequency
# #Frequency Distribution
aa <- tibble(softdrink = c("Coca-Cola", "Diet Coke", "Dr. Pepper", "Pepsi", "Sprite"),
Frequency = c(19, 8, 5, 13, 5))
#
# #Sort, Cummulative Sum, Percentage, and Cummulative Percentage
bb <- aa %>%
arrange(desc(Frequency)) %>%
mutate(cSUM = cumsum(Frequency)) %>%
mutate(PROP = 100 * Frequency / sum(Frequency)) %>%
mutate(PCT = paste0(PROP, "%")) %>%
mutate(cPCT = paste0(100 * cumsum(Frequency) / sum(Frequency), "%"))# #Sorted Bar Chart of Frequencies (Needs x-axis as Factor for proper sorting)
C02P01 <- bb %>% mutate(across(softdrink, factor, levels = rev(unique(softdrink)))) %>% {
ggplot(data = ., aes(x = softdrink, y = Frequency)) +
geom_bar(stat = 'identity', aes(fill = softdrink)) +
scale_y_continuous(sec.axis = sec_axis(~ (. / sum(bb$Frequency))*100, name = "Percentages",
labels = function(b) { paste0(round(b, 0), "%")})) +
geom_text(aes(label = paste0(Frequency, "\n(", PCT, ")")), vjust = 2,
colour = c(rep("black", 2), rep("white", nrow(bb)-2))) +
k_gglayer_bar +
labs(x = "Soft Drinks", y = "Frequency", subtitle = NULL,
caption = "C02P01", title = "Bar Chart of Categorical Data")
}# #Pie Chart of Frequencies (Needs x-axis as Factor for proper sorting)
C02P02 <- bb %>% mutate(across(softdrink, factor, levels = unique(softdrink))) %>% {
ggplot(data = ., aes(x = '', y = Frequency, fill = rev(softdrink))) +
geom_bar(stat = 'identity', width = 1, color = "white") +
coord_polar(theta = "y", start = 0) +
geom_text(aes(label = paste0(softdrink, "\n", Frequency, " (", PCT, ")")),
position = position_stack(vjust = 0.5),
colour = c(rep("black", 2), rep("white", nrow(bb)-2))) +
k_gglayer_pie +
labs(caption = "C02P02", title = "Pie Chart of Categorical Data")
}f_theme_gg <- function(base_size = 14) {
# #Create a Default Theme
theme_bw(base_size = base_size) %+replace%
theme(
# #The whole figure
plot.title = element_text(size = rel(1), face = "bold",
margin = margin(0,0,5,0), hjust = 0),
# #Area where the graph is located
panel.grid.minor = element_blank(),
panel.border = element_blank(),
# #The axes
axis.title = element_text(size = rel(0.85), face = "bold"),
axis.text = element_text(size = rel(0.70), face = "bold"),
# arrow = arrow(length = unit(0.3, "lines"), type = "closed"),
axis.line = element_line(color = "black"),
# The legend
legend.title = element_text(size = rel(0.85), face = "bold"),
legend.text = element_text(size = rel(0.70), face = "bold"),
legend.key = element_rect(fill = "transparent", colour = NA),
legend.key.size = unit(1.5, "lines"),
legend.background = element_rect(fill = "transparent", colour = NA),
# Labels in the case of facetting
strip.background = element_rect(fill = "#17252D", color = "#17252D"),
strip.text = element_text(size = rel(0.85), face = "bold", color = "white", margin = margin(5,0,5,0))
)
}# #Change default ggplot2 theme
theme_set(f_theme_gg())
#
# #List of Specific sets. Note '+' is replaced by ','
k_gglayer_bar <- list(
scale_fill_viridis_d(),
theme(panel.grid.major.x = element_blank(), axis.line = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=1),
legend.position = 'none', axis.title.y.right = element_blank())
)
#
# #Pie
k_gglayer_pie <- list(
scale_fill_viridis_d(),
#theme_void(),
theme(#panel.background = element_rect(fill = "white", colour = "white"),
#plot.background = element_rect(fill = "white",colour = "white"),
axis.line = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
#panel.border = element_rect(colour = "black", fill=NA, size=1),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = 'none')
)
#
# #Histogram
k_gglayer_hist <- list(
scale_fill_viridis_c(direction = -1, alpha = 0.9),
theme(panel.grid.major.x = element_blank(), axis.line.y = element_blank(),
panel.border = element_blank(), axis.ticks.y = element_blank(),
legend.position = 'none')
)
#
# #Scatter Plot Trendline
k_gglayer_scatter <- list(
scale_fill_viridis_d(alpha = 0.9),
theme(panel.grid.minor = element_blank(),
panel.border = element_blank())
)
#
# #BoxPlot
k_gglayer_box <- list(
scale_fill_viridis_d(alpha = 0.9),
theme(panel.grid.major = element_line(colour = "#d3d3d3"),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(), panel.grid.major.x = element_blank(),
#plot.title = element_text(size = 14, family = "Tahoma", face = "bold"),
#text=element_text(family = "Tahoma"),
#axis.title = element_text(face="bold"),
#axis.text.x = element_text(colour="black", size = 11),
#axis.text.y = element_text(colour="black", size = 9),
axis.line = element_line(size=0.5, colour = "black"))
)
#Solution: Use geom_bar(stat = "identity")
A histogram is used for continuous data, where the bins represent ranges of data, while a bar chart is a plot of categorical variables.
The three steps necessary to define the classes for a frequency distribution with quantitative data are
set.seed(3)
# #Get Normal Data
bb <- tibble(aa = rnorm(n = 10000)) # #Histogram
# # '..count..' or '..x..'
C02P03 <- bb %>% {
ggplot(data = ., aes(x = aa, fill = ..count..)) +
geom_histogram(bins = 50, position = "identity") +
k_gglayer_hist +
labs(x = "Normal Data", y = "Count", subtitle = paste0("n = ", format(nrow(.), big.mark = ",")),
caption = "C02P03", title = "Histogram")
}Figure 22.2 geom_histogram(): Histogram
# #Random Data
aa <- c(26, 35, 22, 47, 37, 5, 50, 49, 42, 2, 8, 7, 4, 47, 44, 35, 17, 49, 1, 48,
1, 27, 13, 26, 18, 44, 31, 4, 23, 47, 38, 28, 28, 5, 35, 39, 29, 13, 17,
38, 1, 8, 3, 30, 18, 37, 29, 39, 7, 28)bb <- tibble(aa)
# #Dot Chart of Frequencies
C02P04 <- bb %>% {
ggplot(., aes(x = aa)) +
geom_dotplot(binwidth = 5, method = "histodot") +
theme(axis.line.y = element_blank(), panel.grid = element_blank(), axis.text.y = element_blank(),
axis.ticks.y = element_blank(), axis.title.y = element_blank()) +
labs(x = "Bins", subtitle = "Caution: Avoid! Y-Axis is deceptive.",
caption = "C02P04", title = "Dot Plot")
}Figure 22.3 geom_dotplot(): Frequency Dot Chart
|
|
|
# #Judges: Because we are evaluating 'Judges', they are the 'elements' and thus are the 'rows'
xxJudges <- tibble(Judge_Verdict = c('Abel', 'Ken'), Upheld = c(129, 110), Reversed = c(21, 15))
# #Uaggregated crosstab for both Judges in different types of Courts
xxKen <- tibble(Ken = c("Common", "Municipal "),
Upheld = c(90, 20), Reversed = c(10, 5))
xxAbel <- tibble(Abel = c("Common", "Municipal "),
Upheld = c(29, 100), Reversed = c(3, 18))# #Judges
aa <- tibble(Judge_Verdict = c('Abel', 'Ken'), Upheld = c(129, 110), Reversed = c(21, 15))
bb <- tibble(Verdict_Judge = c('Upheld', 'Reversed'), Abel = c(129, 21), Ken = c(110, 15))
aa
## # A tibble: 2 x 3
## Judge_Verdict Upheld Reversed
## <chr> <dbl> <dbl>
## 1 Abel 129 21
## 2 Ken 110 15
# #Transpose, Assuming First Column Header has "Row_Col" Type Format
ii <- aa %>%
`attr<-`("ColsLost", unlist(strsplit(names(.)[1], "_"))[1]) %>%
`attr<-`("RowsKept", unlist(strsplit(names(.)[1], "_"))[2]) %>%
pivot_longer(c(-1),
names_to = paste0(attributes(.)$RowsKept, "_", attributes(.)$ColsLost),
values_to = "Values") %>%
pivot_wider(names_from = 1, values_from = Values) %>%
`attr<-`("ColsLost", NULL) %>% `attr<-`("RowsKept", NULL)
stopifnot(identical(bb, ii))
ii
## # A tibble: 2 x 3
## Verdict_Judge Abel Ken
## <chr> <dbl> <dbl>
## 1 Upheld 129 110
## 2 Reversed 21 15
# #Testing for Reverse
ii <- bb %>%
`attr<-`("ColsLost", unlist(strsplit(names(.)[1], "_"))[1]) %>%
`attr<-`("RowsKept", unlist(strsplit(names(.)[1], "_"))[2]) %>%
pivot_longer(c(-1),
names_to = paste0(attributes(.)$RowsKept, "_", attributes(.)$ColsLost),
values_to = "Values") %>%
pivot_wider(names_from = 1, values_from = Values) %>%
`attr<-`("ColsLost", NULL) %>% `attr<-`("RowsKept", NULL)
stopifnot(identical(aa, ii))bb <- "Judge_Verdict"
# #Split String by strsplit(), output is list
(ii <- unlist(strsplit(bb, "_")))
## [1] "Judge" "Verdict"
#
# #Split on Dot
bb <- "Judge.Verdict"
# #Using character classes
ii <- unlist(strsplit(bb, "[.]"))
# #By escaping special characters
jj <- unlist(strsplit(bb, "\\."))
# #Using Options
kk <- unlist(strsplit(bb, ".", fixed = TRUE))
stopifnot(all(identical(ii, jj), identical(ii, kk)))jj <- ii <- bb <- aa
# #attr() adds or removes an attribute
attr(bb, "NewOne") <- "abc"
# #Using Backticks
ii <- `attr<-`(ii, "NewOne", "abc")
# #Using Pipe
jj <- jj %>% `attr<-`("NewOne", "abc")
#
stopifnot(all(identical(bb, ii), identical(bb, jj)))
#
# #List Attributes
names(attributes(bb))
## [1] "class" "row.names" "names" "NewOne"
#
# #Specific Attribute Value
attributes(bb)$NewOne
## [1] "abc"
#
# #Remove Attributes
attr(bb, "NewOne") <- NULL
ii <- `attr<-`(ii, "NewOne", NULL)
jj <- jj %>% `attr<-`("NewOne", NULL)
stopifnot(all(identical(bb, ii), identical(bb, jj)))# #(Deprecated) Issues:
# #(1) bind_rows() needs two dataframes. Thus, first can be skipped in Pipe, But...
# #The second dataframe cannot be replaced with dot (.), it has to have a name
# #(2) Pipe usage inside function call was working but was a concern
# #(3) It introduced NA for that replace was needed as another step
ii <- aa %>% bind_rows(aa %>% summarise(across(where(is.numeric), sum))) %>%
mutate(across(1, ~ replace(., . %in% NA, "Total")))
#
# #(Deprecated) Works but needs ALL Column Names individually
jj <- aa %>% add_row(Judge_Verdict = "Total", Upheld = sum(.[, 2]), Reversed = sum(.[, 3]))
kk <- aa %>% add_row(Judge_Verdict = "Total", Upheld = sum(.$Upheld), Reversed = sum(.$Reversed))
#
# #(Deprecated) Removed the Multiple call to sum(). However, it needs First Column Header Name
ll <- aa %>% add_row(Judge_Verdict = "Total", summarise(., across(where(is.numeric), sum)))
# #(Deprecated) Replaced Column Header Name using "Tilde"
mm <- aa %>% add_row(summarise(., across(where(is.character), ~"Total")),
summarise(., across(where(is.numeric), sum)))
stopifnot(all(identical(ii, jj), identical(ii, kk), identical(ii, ll), identical(ii, mm)))
#
# #(Working): Minimised
aa %>% add_row(summarise(., across(1, ~"Total")), summarise(., across(where(is.numeric), sum)))
## # A tibble: 3 x 3
## Judge_Verdict Upheld Reversed
## <chr> <dbl> <dbl>
## 1 Abel 129 21
## 2 Ken 110 15
## 3 Total 239 36# # USE '%in%' for NA, otherwise '==' works
bb <- aa %>% bind_rows(aa %>% summarise(across(where(is.numeric), sum)))
#
ii <- bb %>% mutate(across(1, ~ replace(., . %in% NA, "Total")))
mm <- bb %>% mutate(across(1, ~ replace(., is.na(.), "Total")))
jj <- bb %>% mutate(Judge_Verdict = replace(Judge_Verdict, is.na(Judge_Verdict), "Total"))
kk <- bb %>% mutate(across(1, coalesce, "Total"))
ll <- bb %>% mutate(across(1, ~ replace_na(.x, "Total")))
nn <- bb %>% mutate(across(1, replace_na, "Total"))
stopifnot(all(identical(ii, jj), identical(ii, kk), identical(ii, ll),
identical(ii, mm), identical(ii, nn)))
#
# #Replace NA in a Factor
bb %>%
mutate(Judge_Verdict = factor(Judge_Verdict)) %>%
mutate(across(1, fct_explicit_na, na_level = "Total"))
## # A tibble: 3 x 3
## Judge_Verdict Upheld Reversed
## <fct> <dbl> <dbl>
## 1 Abel 129 21
## 2 Ken 110 15
## 3 Total 239 36# #Convert to Factor
aa %>% mutate(Judge_Verdict = factor(Judge_Verdict))
## # A tibble: 2 x 3
## Judge_Verdict Upheld Reversed
## <fct> <dbl> <dbl>
## 1 Abel 129 21
## 2 Ken 110 15# #Paste but do not execute
aa <- read_delim(clipboard())
# #Copy Excel Data, then execute the above command
#
# #Print its structure
dput(aa)
# #Copy the relevant values, headers in tibble()
bb <- tibble( )
# #The above command will be the setup to generate this tibble anywhereex27 <- tibble(Observation = 1:30,
x = c("A", "B", "B", "C", "B", "C", "B", "C", "A", "B", "A", "B", "C", "C", "C",
"B", "C", "B", "C", "B", "C", "B", "C", "A", "B", "C", "C", "A", "B", "B"),
y = c(1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 1, 2, 2, 2,
2, 1, 1, 1, 1, 2, 1, 2, 1, 1, 2, 2, 1, 1, 2))bb <- ex27
str(bb)
## tibble [30 x 3] (S3: tbl_df/tbl/data.frame)
## $ Observation: int [1:30] 1 2 3 4 5 6 7 8 9 10 ...
## $ x : chr [1:30] "A" "B" "B" "C" ...
## $ y : num [1:30] 1 1 1 2 1 2 1 2 1 1 ...
# #Create CrossTab
bb <- bb %>%
count(x, y) %>%
pivot_wider(names_from = y, values_from = n, values_fill = 0)bb
## # A tibble: 3 x 3
## x `1` `2`
## <chr> <int> <int>
## 1 A 5 0
## 2 B 11 2
## 3 C 2 10
# #Rowwise Percentage in Separate New Columns
bb %>%
mutate(SUM = rowSums(across(where(is.numeric)))) %>%
mutate(across(where(is.numeric), ~ round(. * 100 /SUM, 1), .names = "{.col}_Row" ))
## # A tibble: 3 x 7
## x `1` `2` SUM `1_Row` `2_Row` SUM_Row
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 A 5 0 5 100 0 100
## 2 B 11 2 13 84.6 15.4 100
## 3 C 2 10 12 16.7 83.3 100
#
# #Rowwise Percentage in Same Columns
bb %>%
mutate(SUM = rowSums(across(where(is.numeric)))) %>%
mutate(across(where(is.numeric), ~ round(. * 100 /SUM, 1)))
## # A tibble: 3 x 4
## x `1` `2` SUM
## <chr> <dbl> <dbl> <dbl>
## 1 A 100 0 100
## 2 B 84.6 15.4 100
## 3 C 16.7 83.3 100
#
# #Equivalent
bb %>%
mutate(SUM = rowSums(across(where(is.numeric))),
across(where(is.numeric), ~ round(. * 100 /SUM, 1)))
## # A tibble: 3 x 4
## x `1` `2` SUM
## <chr> <dbl> <dbl> <dbl>
## 1 A 100 0 100
## 2 B 84.6 15.4 100
## 3 C 16.7 83.3 100
#
# #Columnwise Percentage in Separate New Columns
bb %>%
mutate(across(where(is.numeric), ~ round(. * 100 /sum(.), 1), .names = "{.col}_Col" ))
## # A tibble: 3 x 5
## x `1` `2` `1_Col` `2_Col`
## <chr> <int> <int> <dbl> <dbl>
## 1 A 5 0 27.8 0
## 2 B 11 2 61.1 16.7
## 3 C 2 10 11.1 83.3
# #Columnwise Percentage in Same Columns
bb %>%
mutate(across(where(is.numeric), ~ round(. * 100 /sum(.), 1)))
## # A tibble: 3 x 3
## x `1` `2`
## <chr> <dbl> <dbl>
## 1 A 27.8 0
## 2 B 61.1 16.7
## 3 C 11.1 83.3ex28 <- tibble(Observation = 1:20,
x = c(28, 17, 52, 79, 37, 71, 37, 27, 64, 53, 13, 84, 59, 17, 70, 47, 35, 62, 30, 43),
y = c(72, 99, 58, 34, 60, 22, 77, 85, 45, 47, 98, 21, 32, 81, 34, 64, 68, 67, 39, 28))bb <- ex28
# #Rounding to the lowest 10s before min and to the highest 10s after max
nn <- 10L
n_x <- seq(floor(min(bb$x) / nn) * nn, ceiling(max(bb$x) / nn) * nn, by = 20)
n_y <- seq(floor(min(bb$y) / nn) * nn, ceiling(max(bb$y) / nn) * nn, by = 20)
#
# #Labels in the format of [10-29]
lab_x <- paste0(n_x, "-", n_x + 20 - 1) %>% head(-1)
lab_y <- paste0(n_y, "-", n_y + 20 - 1) %>% head(-1)
# #Wider Table without Totals
ii <- bb %>%
mutate(x_bins = cut(x, breaks = n_x, right = FALSE, labels = lab_x),
y_bins = cut(y, breaks = n_y, right = FALSE, labels = lab_y)) %>%
count(x_bins, y_bins) %>%
pivot_wider(names_from = y_bins, values_from = n, values_fill = 0, names_sort = TRUE)
print(ii)
## # A tibble: 4 x 5
## x_bins `20-39` `40-59` `60-79` `80-99`
## <fct> <int> <int> <int> <int>
## 1 10-29 0 0 1 4
## 2 30-49 2 0 4 0
## 3 50-69 1 3 1 0
## 4 70-89 4 0 0 0
# #Cross Tab with Total Column and Total Row
jj <- ii %>%
bind_rows(ii %>% summarise(across(where(is.numeric), sum))) %>%
mutate(across(1, fct_explicit_na, na_level = "Total")) %>%
mutate(SUM = rowSums(across(where(is.numeric))))
print(jj)
## # A tibble: 5 x 6
## x_bins `20-39` `40-59` `60-79` `80-99` SUM
## <fct> <int> <int> <int> <int> <dbl>
## 1 10-29 0 0 1 4 5
## 2 30-49 2 0 4 0 6
## 3 50-69 1 3 1 0 5
## 4 70-89 4 0 0 0 4
## 5 Total 7 3 6 4 20# #Group Continuous Data to Categorical Bins by base::cut()
bb <- ex28
#
# #NOTE cut() increases the range slightly but ggplot functions do not
bb %>% mutate(x_bins = cut(x, breaks = 8)) %>%
pull(x_bins) %>% levels()
## [1] "(12.9,21.9]" "(21.9,30.8]" "(30.8,39.6]" "(39.6,48.5]" "(48.5,57.4]" "(57.4,66.2]"
## [7] "(66.2,75.1]" "(75.1,84.1]"
#
# #By default, it excludes the lower range, but it can be included by option
bb %>% mutate(x_bins = cut(x, breaks = 8, include.lowest = TRUE)) %>%
pull(x_bins) %>% levels()
## [1] "[12.9,21.9]" "(21.9,30.8]" "(30.8,39.6]" "(39.6,48.5]" "(48.5,57.4]" "(57.4,66.2]"
## [7] "(66.2,75.1]" "(75.1,84.1]"
#
# #ggplot::cut_interval() makes n groups with equal range. There is a cut_number() also
bb %>% mutate(x_bins = cut_interval(x, n = 8)) %>%
pull(x_bins) %>% levels()
## [1] "[13,21.9]" "(21.9,30.8]" "(30.8,39.6]" "(39.6,48.5]" "(48.5,57.4]" "(57.4,66.2]"
## [7] "(66.2,75.1]" "(75.1,84]"
#
# #Specific Bins
bb %>% mutate(x_bins = cut(x, breaks = seq(10, 90, by = 10))) %>%
pull(x_bins) %>% levels()
## [1] "(10,20]" "(20,30]" "(30,40]" "(40,50]" "(50,60]" "(60,70]" "(70,80]" "(80,90]"
ii <- bb %>% mutate(x_bins = cut(x, breaks = seq(10, 90, by = 10), include.lowest = TRUE)) %>%
pull(x_bins) %>% levels()
print(ii)
## [1] "[10,20]" "(20,30]" "(30,40]" "(40,50]" "(50,60]" "(60,70]" "(70,80]" "(80,90]"
#
# #ggplot::cut_width() makes groups of width
bb %>% mutate(x_bins = cut_width(x, width = 10)) %>%
pull(x_bins) %>% levels()
## [1] "[5,15]" "(15,25]" "(25,35]" "(35,45]" "(45,55]" "(55,65]" "(65,75]" "(75,85]"
#
# #Match cut_width() and cut()
jj <- bb %>% mutate(x_bins = cut_width(x, width = 10, boundary = 0)) %>%
pull(x_bins) %>% levels()
print(jj)
## [1] "[10,20]" "(20,30]" "(30,40]" "(40,50]" "(50,60]" "(60,70]" "(70,80]" "(80,90]"
stopifnot(identical(ii, jj))
#
# #Labelling
n_breaks <- seq(10, 90, by = 10)
n_labs <- paste0("*", n_breaks, "-", n_breaks + 10) %>% head(-1)
bb %>% mutate(x_bins = cut(x, breaks = n_breaks, include.lowest = TRUE, labels = n_labs)) %>%
pull(x_bins) %>% levels()
## [1] "*10-20" "*20-30" "*30-40" "*40-50" "*50-60" "*60-70" "*70-80" "*80-90"xxCommercials <- tibble(Week = 1:10,
Commercials = c(2, 5, 1, 3, 4, 1, 5, 3, 4, 2),
Sales = c(50, 57, 41, 54, 54, 38, 63, 48, 59, 46))
f_setRDS(xxCommercials)Figure 22.4 geom_point(), geom_smooth(), & stat_poly_eq()
bb <- xxCommercials
# #Define the formula for Trendline calculation
k_gg_formula <- y ~ x
#
# #Scatterplot, Trendline alongwith its equation and R2 value
C02P05 <- bb %>% {
ggplot(data = ., aes(x = Commercials, y = Sales)) +
geom_smooth(method = 'lm', formula = k_gg_formula, se = FALSE) +
stat_poly_eq(aes(label = paste0("atop(", ..eq.label.., ", \n", ..rr.label.., ")")),
formula = k_gg_formula, eq.with.lhs = "italic(hat(y))~`=`~",
eq.x.rhs = "~italic(x)", parse = TRUE) +
geom_point() +
labs(x = "Commercials", y = "Sales ($100s)",
subtitle = paste0("Trendline equation and R", '\u00b2', " value"),
caption = "C02P05", title = "Scatter Plot")
}21.22 The measurable quality or characteristic is called a Population Parameter if it is computed from the population. It is called a Sample Statistic if it is computed from a sample.
Individual numbers can be represented by symbols, called numerals; for example, “5” is a numeral that represents the ‘number five.’
As only a relatively small number of symbols can be memorized, basic numerals are commonly organized in a numeral system, which is an organized way to represent any number. The most common numeral system is the Hindu–Arabic numeral system, which allows for the representation of any number using a combination of ten fundamental numeric symbols, called digits.
Counting is the process of determining the number of elements of a finite set of objects, i.e., determining the size of a set. Enumeration refers to uniquely identifying the elements of a set by assigning a number to each element.
Measurement is the quantification of attributes of an object or event, which can be used to compare with other objects or events.
Formally, \(\mathbb{N} \to \mathbb{Z} \to \mathbb{Q} \to \mathbb{R} \to \mathbb{C}\)
Practically, \(\mathbb{N} \subset \mathbb{Z} \subset \mathbb{Q} \subset \mathbb{R} \subset \mathbb{C}\)
The natural numbers \(\mathbb{N}\) are those numbers used for counting and ordering. ISO standard begin the natural numbers with 0, corresponding to the non-negative integers \(\mathbb{N} = \{0, 1, 2, 3, \ldots \}\), whereas others start with 1, corresponding to the positive integers \(\mathbb{N^*} = \{1, 2, 3, \ldots \}\)
The set of integers \(\mathbb{Z}\) consists of zero (\({0}\)), the positive natural numbers \(\{1, 2, 3, \ldots \}\) and their additive inverses (the negative integers). Thus i.e., \(\mathbb{Z} = \{\ldots, -3, -2, -1, 0, 1, 2, 3, \ldots \}\). An integer is colloquially defined as a number that can be written without a fractional component.
Rational numbers \(\mathbb{Q}\) are those which can be expressed as the quotient or fraction p/q of two integers, a numerator p and a non-zero denominator q. Thus, Rational Numbers \(\mathbb{Q} = \{1, 2, 3, \ldots \}\)
A real number is a value of a continuous quantity that can represent a distance along a line. The real numbers include all the rational numbers \(\mathbb{Q}\), and all the irrational numbers. Thus, Real Numbers \(\mathbb{R} = \mathbb{Q} \cup \{\sqrt{2}, \sqrt{3}, \ldots\} \cup \{ \pi, e, \phi, \ldots \}\)
The complex numbers \(\mathbb{C}\) contain numbers which are expressed in the form \(a + ib\), where \({a}\) and \({b}\) are real numbers. These have two components the real numbers and a specific element denoted by \({i}\) (imaginary unit) which satisfies the equation \(i^2 = −1\).
The number Pi \(\pi = 3.14159\ldots\) is defined as the ratio of circumference of a circle to its diameter.
\[\pi = \int _{-1}^{1} \frac{dx}{\sqrt{1- x^2}} \tag{23.1}\]
\[e^{i\varphi}=\cos \varphi + i\sin \varphi \tag{23.2}\]
\[e^{i\pi} + 1 = 0 \tag{23.3}\]
# #Read OIS File for 20000 PI digits including integral (3) and fractional (14159...)
# #md5sum = "daf0b33a67fd842a905bb577957a9c7f"
tbl <- read_delim(file = paste0(.z$XL, "PI-OIS-b000796.txt"),
delim = " ", col_names = c("POS", "VAL"), col_types = list(POS = "i", VAL = "i"))
attr(tbl, "spec") <- NULL
attr(tbl, "problems") <- NULL
xxPI <- tbl
f_setRDS(xxPI)Euler Number \(e = 2.71828\ldots\), is the base of the natural logarithm.
\[e = \lim_{n \to \infty} \left(1 + \frac{1}{n} \right)^{n} = \sum \limits_{n=0}^{\infty} \frac{1}{n!} \tag{23.4}\]
Two quantities are in the golden ratio \(\varphi = 1.618\ldots\) if their ratio is the same as the ratio of their sum to the larger of the two quantities.
\[\varphi^2 - \varphi - 1 = 0 \\ \varphi = \frac{1 + \sqrt{5}}{2} \tag{23.5}\]
Mersenne primes:
# #Create empty Vector with NA
aa <- rep(NA_integer_, 10)
print(aa)
## [1] NA NA NA NA NA NA NA NA NA NAf_isPrime <- function(x) {
# #Check if the number is Prime
if(!is.integer(x)) {
cat("Error! Integer required. \n")
stop()
} else if(x <= 0L) {
cat("Error! Positive Integer required. \n")
stop()
} else if(x > 2147483647L) {
cat(paste0("Doubles are stored as approximation. Prime will not be calculated for value higher than '2147483647' \n"))
stop()
}
# #However, this checks the number against ALL Smaller values including non-primes
if(x == 2L || all(x %% 2L:ceiling(sqrt(x)) != 0)) {
# # "seq.int(3, ceiling(sqrt(x)), 2)" is slower
return(TRUE)
} else {
## (any(x %% 2L:ceiling(sqrt(x)) == 0))
## (any(x %% seq.int(3, ceiling(sqrt(x)), 2) == 0))
## NOTE Further, if sequence starts from 3, add 2 also as a Prime Number
return(FALSE)
}
}
# #Vectorise Version
f_isPrimeV <- Vectorize(f_isPrime)
# #Compiled Version
f_isPrimeC <- cmpfun(f_isPrime)# #There are 4 Primes in First 10, 25 in 100, 168 in 1000, 1229 in 10000.
# # Using Vectorise Version, get all the Primes
aa <- 1:10
bb <- aa[f_isPrimeV(aa)]
ii <- f_getPrimeUpto(10)
stopifnot(identical(bb, ii))
# #
xxPrime10 <- c(2, 3, 5, 7) |> as.integer()
# #
xxPrime100 <- c(2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47,
53, 59, 61, 67, 71, 73, 79, 83, 89, 97) |> as.integer()
#
# #Generate List of ALL Primes till 524287 (i.e. Total 43,390 Primes)
xxPrimes <- f_getPrimeUpto(524287L)
# #Save as RDS
f_setRDS(xxPrimes)# #NOTE: Assigning 2147483647L causes the Chunk to throw Warnings even with 'eval=FALSE'.
if(FALSE){
# #Assignment of 2305843009213693951L is NOT possible without Warning
# #Even within non-executing Block or with 'eval=FALSE' or suppressWarnings() or tryCatch()
# #It cannot be stored as integer, thus it is automatically converted to double
#bb <- 2305843009213693951L
# #Warning: non-integer value 2305843009213693951L qualified with L; using numeric value
# #NOTE that the value changed. It is explicitly NOT a prime anymore.
#print(aa, digits = 20)
# #[1] 2305843009213693952
#
# #Assignment of 2147483647L is possible and direct printing in console works BUT
# #Its printing will also throw Warnings that are difficult to handle
# #Avoid Printing. Even within non-executing Block, it is affecting R Bookdown.
aa <- 2147483647L
#print(aa)
}f_getPrimeUpto <- function(x){
# #Get a Vector of Primes upto the given Number (Max. 524287)
if(x < 2) {
print("NOT ALLOWED!")
return(NULL)
} else if(x > 524287){
print("Sadly, beyond this number it is difficult to generate the List of Primes!")
return(NULL)
}
y <- 2:x
i <- 1
while (y[i] <= sqrt(x)) {
y <- y[y %% y[i] != 0 | y == y[i]]
i <- i+1
}
return(y)
}# #Compare any number of functions
result <- microbenchmark(
sum(1:100)/length(1:100),
mean(1:100),
#times = 1000,
check = 'identical'
)
# #Print Table
print(result)
##Unit: microseconds
## expr min lq mean median uq max neval cld
## sum(1:100)/length(1:100) 1.2 1.301 1.54795 1.5005 1.6005 7.501 100 a
## mean(1:100) 5.9 6.001 6.56989 6.1010 6.2010 28.001 100 b
#
# #Boxplot of Benchmarking Result
#autoplot(result)
# #Above testcase showed a surprising result of sum()/length() being much faster than mean()
#
# #Or Compare Plot Rendering
if(FALSE) microbenchmark(print(jj), print(kk), print(ll), times = 2)“ForLater” - Include rowsum(), rowSums(), colSums(), rowMeans(), colMeans() in this also.
# #Conclusion: use mean() because precision is difficult to achieve compared to speed
#
# #sum()/length() is faster than mean()
# #However, mean() does double pass, so it would be more accurate
# #mean.default() and var() compute means with an additional pass and so are more accurate
# #e.g. the variance of a constant vector is (almost) always zero
# #and the mean of such a vector will be equal to the constant value to machine precision.
aa <- 1:100
#
microbenchmark(
sum(aa)/length(aa),
mean(aa),
mean.default(aa),
.Internal(mean(aa)),
#times = 1000,
check = 'identical'
)
## Unit: nanoseconds
## expr min lq mean median uq max neval cld
## sum(aa)/length(aa) 400 500 588 500 600 3900 100 a
## mean(aa) 4000 4250 4816 4500 5100 23600 100 c
## mean.default(aa) 1400 1500 1788 1600 1800 10900 100 b
## .Internal(mean(aa)) 500 550 630 600 600 5300 100 a
# #rnorm() generates random deviates of given length
set.seed(3)
aa <- rnorm(1e7)
str(aa)
## num [1:10000000] -0.962 -0.293 0.259 -1.152 0.196 ...
#
# #NOTE manual calculation and mean() is NOT matching
identical(sum(aa)/length(aa), mean(aa))
## [1] FALSE
#
# #There is a slight difference
sum(aa)/length(aa) - mean(aa)
## [1] 2.355429e-17if(FALSE) {
# #Remove all objects matching a pattern
rm(list = ls(pattern = "f_"))
}# #Check the Current Options Value
getOption("expressions")
## [1] 5000
if(FALSE) {
# #Change Value
# #NOTE it did not help when recursive function failed
# #Error: node stack overflow
# #Error during wrapup: node stack overflow
# #Error: no more error handlers available ...
options(expressions=10000)
}# #To Vectorise a Function
f_isPrimeV <- Vectorize(f_isPrime)# #To Pre-Compile a Function for faster performance
f_isPrimeC <- cmpfun(f_isPrime)# #To Profile a Function Calls for improvements
Rprof("file.out")
f_isPrime(2147483647L)
#f_getPrimesUpto(131071L)
Rprof(NULL)
summaryRprof("file.out")# #Functions to check for PRIME - All of them have various problems
# #"-3L -2L -1L 0L 1L 8L" FALSE "2L 3L ... 524287L 2147483647L" TRUE
isPrime_a <- function(x) {
# #Fails for "2147483647L" Error: cannot allocate vector of size 8.0 Gb
if (x == 2L) {
return(TRUE)
} else if (any(x %% 2:(x-1) == 0)) {
return(FALSE)
} else return(TRUE)
}
isPrime_b <- function(x){
# #Comparison of Division and Integer Division by 1, 2, ..., x
# #Fails for "2147483647L" Error: cannot allocate vector of size 16.0 Gb
# #Fails for "-ve and zero" Error: missing value where TRUE/FALSE needed
# vapply(x, function(y) sum(y / 1:y == y %/% 1:y), integer(1L)) == 2L
if(sum(x / 1:x == x %/% 1:x) == 2) {
return(TRUE)
} else return(FALSE)
}
isPrime_c <- function(x) {
# #RegEx Slowest: Iit converts -ve values and coerce non-integers which may result in bugs
x <- abs(as.integer(x))
if(x > 8191L) {
print("Do not run this with large values. RegEx is really slow.")
stop()
}
!grepl('^1?$|^(11+?)\\1+$', strrep('1', x))
}
isPrime_d <- function(x) {
# #Fails for "1" & returns TRUE
# #Fails for "-ve and zero" Error: NA/NaN argument
if(x == 2L || all(x %% 2L:max(2, floor(sqrt(x))) != 0)) {
return(TRUE)
} else return(FALSE)
}
isPrime_e <- function(x) {
# #Fails for "-ve and zero" Error: NA/NaN argument
# #This is the most robust which can be improved by conditional check for positive integers
# #However, this checks the number against ALL Smaller values including non-primes
if(x == 2L || all(x %% 2L:ceiling(sqrt(x)) != 0)) {
# # "seq.int(3, ceiling(sqrt(x)), 2)" is slower
return(TRUE)
} else {
## (any(x %% 2L:ceiling(sqrt(x)) == 0))
## (any(x %% seq.int(3, ceiling(sqrt(x)), 2) == 0))
## NOTE Further, if sequence starts from 3, add 2 also as a Prime Number
return(FALSE)
}
}# #131071 (12,251th), 524287 (43,390th), 2147483647 (105,097,565th)
aa <- 1:131071
# #Following works but only till 524287L, Memory Overflow ERROR for 2147483647L
bb <- aa[f_isPrimeV(aa)]
getPrimeUpto_a <- function(x){
# #Extremely slow, cannot go beyond 8191L in benchmark testing
if(x < 2) return("ERROR")
y <- 2:x
primes <- rep(2L, x)
j <- 1L
for (i in y) {
if (!any(i %% primes == 0)) {
j <- j + 1L
primes[j] <- i
#cat(paste0("i=", i, ", j=", j, ", Primes= ", paste0(head(primes, j), collapse = ", ")))
}
#cat("\n")
}
result <- head(primes, j)
#str(result)
#cat(paste0("Head: ", paste0(head(result), collapse = ", "), "\n"))
#cat(paste0("Tail: ", paste0(tail(result), collapse = ", "), "\n"))
return(result)
}
getPrimeUpto_b <- function(x){
# #https://stackoverflow.com/questions/3789968/
# #This is much faster even from the "aa[f_isPrimeV(aa)]"
if(x < 2) return("ERROR")
y <- 2:x
i <- 1
while (y[i] <= sqrt(x)) {
y <- y[y %% y[i] != 0 | y == y[i]]
i <- i+1
}
result <- y
#str(result)
#cat(paste0("Head: ", paste0(head(result), collapse = ", "), "\n"))
#cat(paste0("Tail: ", paste0(tail(result), collapse = ", "), "\n"))
return(result)
}
getPrimeUpto_c <- function(x) {
# #Problems and Slow
# #Returns a Vetor of Primes till the Number i.e. f_getPrimesUpto(7) = (2, 3, 5, 7)
# #NOTE: f_getPrimesUpto(1) and f_getPrimesUpto(2) both return "2"
if(!is.integer(x)) {
cat("Error! Integer required. \n")
stop()
} else if(!identical(1L, length(x))) {
cat("Error! Unit length vector required. \n")
stop()
} else if(x <= 0L) {
cat("Error! Positive Integer required. \n")
stop()
} else if(x > 2147483647) {
cat(paste0("Doubles are stored as approximation. Prime will not be calculated for value higher than '2147483647' \n"))
stop()
}
# #Cannot create vector of length 2147483647L and also not needed that many
# #ceiling(sqrt(7L)) return 3, however we need length 4 (2, 3, 5, 7)
# #So, added extra 10
#primes <- rep(NA_integer_, 10L + sqrt(2L))
primes <- rep(2L, 10L + sqrt(2L))
j <- 1L
primes[j] <- 2L
#
i <- 2L
while(i <= x) {
# #na.omit() was the slowest step, so changed all NA to 2L in the primes
#k <- na.omit(primes[primes <= ceiling(sqrt(i))])
k <- primes[primes <= ceiling(sqrt(i))]
if(all(as.logical(i %% k))) {
j <- j + 1
primes[j] <- i
}
# #Increment with INTEGER Addition
i = i + 1L
}
result <- primes[complete.cases(primes)]
str(result)
cat(paste0("Head: ", paste0(head(result), collapse = ", "), "\n"))
cat(paste0("Tail: ", paste0(tail(result), collapse = ", "), "\n"))
return(result)
}
getPrimeUpto_d <- function(n = 10L, i = 2L, primes = c(2L), bypass = TRUE){
# #Using Recursion is NOT a good solution
# #Function to return N Primes upto 1000 Primes (7919) or Max Value reaching 10000.
if(i > 10000){
cat("Reached 10000 \n")
return(primes)
}
if(bypass) {
maxN <- 1000L
if(!is.integer(n)) {
cat("Error! Integer required. \n")
stop()
} else if(!identical(1L, length(n))) {
cat("Error! Unit length vector required. \n")
stop()
} else if(n <= 0L) {
cat("Error! Positive Integer required. \n")
stop()
} else if(n > maxN) {
cat(paste0("Error! This will calculate only upto ", maxN, " prime Numebers. \n"))
stop()
}
}
if(length(primes) < n) {
if(all(as.logical(i %% primes[primes <= ceiling(sqrt(i))]))) {
# #Coercing 0 to FALSE, Non-zero Values to TRUE
# # "i %% 2L:ceiling(sqrt(i))" checks i agains all integers till sqrt(i)
# # "primes[primes <= ceiling(sqrt(i))]" checks i against only the primes till sqrt(i)
# #However, the above needs hardcoded 2L as prime so the vector is never empty
# #Current Number is Prime, so include it in the vector and check the successive one
f_getPrime(n, i = i+1, primes = c(primes, i), bypass = FALSE)
} else {
# #Current Number is NOT Prime, so check the successive one
f_getPrime(n, i = i+1, primes = primes, bypass = FALSE)
}
} else {
# #Return the vector when it reaches the count
return(primes)
}
}\[\overline{x} = \frac{1}{n}\left (\sum_{i = 1}^n{{x}_i}\right ) = \frac{{x}_1 + {x}_2 + \cdots + {x}_n}{n} \tag{23.6}\]
In the mean calculation, normally each \({{x}_i}\) is given equal importance or weightage of \({1/n}\). However, in some instances the mean is computed by giving each observation a weight that reflects its relative importance. A mean computed in this manner is referred to as the weighted mean, as given in equation (23.7)
\[\overline{x} = \frac{\sum_{i=1}^n{w_ix_i}}{\sum_{i=1}^n{w_i}} \tag{23.7}\]
Caution: Unit of mean is same as unit of the variable e.g. cost_per_kg thus ‘w’ would be ‘kg.’
aa <- 1:10
# #Mean of First 10 Numbers
mean(aa)
## [1] 5.5aa <- 1:10
# #Mean of First 10 Numbers
ii <- mean(aa)
print(ii)
## [1] 5.5
jj <- sum(aa)/length(aa)
stopifnot(identical(ii, jj))
#
# #Mean of First 10 Prime Numbers (is neither Prime nor Integer)
mean(f_getRDS(xxPrimes)[1:10])
## [1] 12.9
#
# #Mean of First 100 Digits of PI
f_getRDS(xxPI)[1:100, ] %>% pull(VAL) %>% mean()
## [1] 4.71aa <- tibble(Purchase = 1:5, cost_per_kg = c(3, 3.4, 2.8, 2.9, 3.25),
kg = c(1200, 500, 2750, 1000, 800))
# #NOTE that unit of mean is same as unit of the variable e.g. cost_per_kg thus 'w' would be 'kg'
(ii <- sum(aa$cost_per_kg * aa$kg)/sum(aa$kg))
## [1] 2.96
jj <- with(aa, sum(cost_per_kg * kg)/sum(kg))
kk <- weighted.mean(x = aa$cost_per_kg, w = aa$kg)
stopifnot(all(identical(ii, jj), identical(ii, kk)))\[\begin{align} \text{if n is odd, } median(x) & = x_{(n + 1)/2} \\ \text{if n is even, } median(x) & = \frac{x_{(n/2)} + x_{(n/2) + 1}}{2} \end{align} \tag{23.8}\]
aa <- 1:10
# #Median of First 10 Numbers
median(aa)
## [1] 5.5aa <- 1:10
# #Median of First 10 Numbers
median(aa)
## [1] 5.5
#
# #Median of First 10 Prime Numbers (is NOT prime)
median(f_getRDS(xxPrimes)[1:10])
## [1] 12
#
# #Median of First 100 Digits of PI
f_getRDS(xxPI)[1:100, ] %>% pull(VAL) %>% median()
## [1] 4.5\[\overline{x}_g = \left(\prod _{i=1}^{n} x_i\right)^{\frac{1}{n}} = \sqrt[{n}]{x_1 x_2 \ldots x_n} \tag{23.9}\]
aa <- 1:10
# #Geometric Mean of of First 10 Numbers
exp(mean(log(aa)))
## [1] 4.528729aa <- 1:10
# #Geometric Mean of of First 10 Numbers
ii <- exp(mean(log(aa)))
jj <- prod(aa)^(1/length(aa))
stopifnot(identical(ii, jj))
#
# #Geometric Mean of First 10 Prime Numbers
exp(mean(log(f_getRDS(xxPrimes)[1:10])))
## [1] 9.573889# #Mode of First 100 Digits of PI
bb <- f_getRDS(xxPI)[1:100, ] %>% pull(VAL)
f_getMode(bb)
## [1] 9# #Mode of First 100 Digits of PI
bb <- f_getRDS(xxPI)[1:100, ]
#
# #Get Frequency
bb %>% count(VAL)
## # A tibble: 10 x 2
## VAL n
## <int> <int>
## 1 0 8
## 2 1 8
## 3 2 12
## 4 3 12
## 5 4 10
## 6 5 8
## 7 6 9
## 8 7 8
## 9 8 12
## 10 9 13
#
# #Get Mode
bb %>% pull(VAL) %>% f_getMode()
## [1] 9f_getMode <- function(x) {
# #Calculate Statistical Mode
# #NOTE: Single Length, All NA, Characters etc. have NOT been validated
# #https://stackoverflow.com/questions/56552709
# #https://stackoverflow.com/questions/2547402
# #Remove NA
if (anyNA(x)) {
x <- x[!is.na(x)]
}
# #Get Unique Values
ux <- unique(x)
# #Match
ux[which.max(tabulate(match(x, ux)))]
}type =6 option of quantile(), default is type =7\[L_p = \frac{p}{100}(n + 1) \tag{23.10}\]
# #First 100 Digits of PI
bb <- f_getRDS(xxPI)[1:100, ]
#
# #50% Percentile of Digits i.e. Median
quantile(bb$VAL, 0.5)
## 50%
## 4.5# #First 100 Digits of PI
bb <- f_getRDS(xxPI)[1:100, ]
#
# #50% Percentile of Digits i.e. Median
ii <- quantile(bb$VAL, 0.5)
print(ii)
## 50%
## 4.5
jj <- median(bb$VAL)
stopifnot(identical(unname(ii), jj))
#
# #All Quartiles
quantile(bb$VAL, seq(0, 1, 0.25))
## 0% 25% 50% 75% 100%
## 0.00 2.00 4.50 7.25 9.00
# #summary()
summary(bb$VAL)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 2.00 4.50 4.71 7.25 9.00
#
# #To Match with Excel "PERCENTILE.EXC" use type=6 in place of default type=7
quantile(bb$VAL, seq(0, 1, 0.25), type = 6)
## 0% 25% 50% 75% 100%
## 0.00 2.00 4.50 7.75 9.00In addition to measures of location, it is often desirable to consider measures of variability, or dispersion.
range()
max() - min()IQR()
\[\begin{align} \sigma^2 &= \frac{1}{n} \sum _{i=1}^{n} \left(x_i - \mu \right)^2 \\ s^2 &= \frac{1}{n-1} \sum _{i=1}^{n} \left(x_i - \overline{x} \right)^2 \end{align} \tag{23.11}\]
\[\begin{align} \sigma &= \sqrt{\frac{1}{N} \sum_{i=1}^N \left(x_i - \mu\right)^2} \\ {s} &= \sqrt{\frac{1}{N-1} \sum_{i=1}^N \left(x_i - \overline{x}\right)^2} \end{align} \tag{23.12}\]
Refer figure 23.1
Skewness is given by the equation (23.13), which is being shown here because it looked cool has deep meaning
\[Skew = \frac{\tfrac {1}{n}\sum_{i=1}^{n}(x_{i}-{\overline{x}})^{3}}{\left[\tfrac {1}{n-1}\sum_{i=1}^{n}(x_{i}-{\overline{x}})^{2} \right]^{3/2}} \tag{23.13}\]
Figure 23.1 (Left Tail, Negative) Beta, Normal Distribution, Exponential (Positive, Right Tail)
# #Skewness Calculation: Package "e1071" (Package "moments" deprecated)
even_skew <- c(49, 50, 51)
pos_skew <- c(even_skew, 60)
neg_skew <- c(even_skew, 40)
skew_lst <- list(even_skew, pos_skew, neg_skew)
# #Mean, Median, SD
cat(paste0("Mean (neg, even, pos): ",
paste0(vapply(skew_lst, mean, numeric(1)), collapse = ", "), "\n"))
## Mean (neg, even, pos): 50, 52.5, 47.5
cat(paste0("Median (neg, even, pos): ",
paste0(vapply(skew_lst, median, numeric(1)), collapse = ", "), "\n"))
## Median (neg, even, pos): 50, 50.5, 49.5
cat(paste0("SD (neg, even, pos): ", paste0(
round(vapply(skew_lst, sd, numeric(1)), 1), collapse = ", "), "\n"))
## SD (neg, even, pos): 1, 5.1, 5.1
#
cat(paste0("Skewness (neg, even, pos): ", paste0(
round(vapply(skew_lst, e1071::skewness, numeric(1)), 1), collapse = ", "), "\n"))
## Skewness (neg, even, pos): 0, 0.7, -0.7
cat(paste0("Kurtosis (neg, even, pos): ", paste0(
round(vapply(skew_lst, e1071::kurtosis, numeric(1)), 1), collapse = ", "), "\n"))
## Kurtosis (neg, even, pos): -2.3, -1.7, -1.7# #Skewness Calculation: Package "e1071" (Package "moments" deprecated)
dis_lst <- list(xxNormal, xxExp, xxBeta)
#
# #Skewness: Normal has value close to 3 Kurtosis (=0 excess Kurtosis)
# #Skewness "e1071" has Type = 3 as default. Its Type = 1 matches "moments"
# #Practically, Normal has (small) NON-Zero Positive Skewness
skew_e_t3 <- vapply(dis_lst, e1071::skewness, numeric(1))
skew_e_t2 <- vapply(dis_lst, e1071::skewness, type = 2, numeric(1))
skew_e_t1 <- vapply(dis_lst, e1071::skewness, type = 1, numeric(1))
skew_mmt <- vapply(dis_lst, moments::skewness, numeric(1))
stopifnot(identical(round(skew_e_t1, 10), round(skew_mmt, 10)))
cat(paste0("e1071: Type = 1 Skewness (Normal, Exp, Beta): ",
paste0(round(skew_e_t1, 4), collapse = ", "), "\n"))
## e1071: Type = 1 Skewness (Normal, Exp, Beta): 0.0407, 2.0573, -0.6279
cat(paste0("e1071: Type = 2 Skewness (Normal, Exp, Beta): ",
paste0(round(skew_e_t2, 4), collapse = ", "), "\n"))
## e1071: Type = 2 Skewness (Normal, Exp, Beta): 0.0407, 2.0576, -0.628
cat(paste0("e1071: Type = 3 Skewness (Normal, Exp, Beta): ",
paste0(round(skew_e_t3, 4), collapse = ", "), "\n"))
## e1071: Type = 3 Skewness (Normal, Exp, Beta): 0.0407, 2.057, -0.6278
#
# #Formula: (sigma_ (x_i - mu)^3) /(n * sd^3)
bb <- xxNormal
skew_man <- sum({bb - mean(bb)}^3) / {length(bb) * sd(bb)^3}
cat(paste0("(Manual) Skewness of Normal: ", round(skew_man, 4),
" (vs. e1071 Type 3 = ", round(skew_e_t3[1], 4), ") \n"))
## (Manual) Skewness of Normal: 0.0407 (vs. e1071 Type 3 = 0.0407)set.seed(3)
nn <- 10000L
# #Normal distribution is symmetrical
xxNormal <- rnorm(n = nn, mean = 0, sd = 1)
# #The exponential distribution is positive skew
xxExp <- rexp(n = nn, rate = 1)
# #The beta distribution with hyper-parameters α=5 and β=2 is negative skew
xxBeta <- rbeta(n = nn, shape1 = 5, shape2 = 2)
#
# #Save
f_setRDS(xxNormal)
f_setRDS(xxExp)
f_setRDS(xxBeta)
#f_getRDS(xxNormal)# #Get the Distributions
xxNormal <- f_getRDS(xxNormal)
xxExp <- f_getRDS(xxExp)
xxBeta <- f_getRDS(xxBeta)# #Density Curve
# #Assumes 'hh' has data in 'ee'. In: cap_hh
#Basics
mean_hh <- mean(hh$ee)
sd_hh <- sd(hh$ee)
#
skew_hh <- skewness(hh$ee)
kurt_hh <- kurtosis(hh$ee)
# #Get Quantiles and Ranges of mean +/- sigma
q05_hh <- quantile(hh[[1]], .05)
q95_hh <- quantile(hh[[1]], .95)
density_hh <- density(hh[[1]])
density_hh_tbl <- tibble(x = density_hh$x, y = density_hh$y)
sig3r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + 3 * sd_hh})
sig3l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - 3 * sd_hh})
sig2r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + 2 * sd_hh}, {x < mean_hh + 3 * sd_hh})
sig2l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - 2 * sd_hh}, {x > mean_hh - 3 * sd_hh})
sig1r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + sd_hh}, {x < mean_hh + 2 * sd_hh})
sig1l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - sd_hh}, {x > mean_hh - 2 * sd_hh})
sig0r_hh <- density_hh_tbl %>% filter(x > mean_hh, {x < mean_hh + 1 * sd_hh})
sig0l_hh <- density_hh_tbl %>% filter(x < mean_hh, {x > mean_hh - 1 * sd_hh})
#
# #Change x-Axis Ticks interval
xbreaks_hh <- seq(-3, 3)
xpoints_hh <- mean_hh + xbreaks_hh * sd_hh
#
# # Latex Labels
xlabels_hh <- c(TeX(r'($\,\,\mu - 3 \sigma$)'), TeX(r'($\,\,\mu - 2 \sigma$)'),
TeX(r'($\,\,\mu - 1 \sigma$)'), TeX(r'($\mu$)'), TeX(r'($\,\,\mu + 1 \sigma$)'),
TeX(r'($\,\,\mu + 2 \sigma$)'), TeX(r'($\,\,\mu + 3\sigma$)'))
#
C03 <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) +
geom_density(alpha = 0.2, colour = "#21908CFF") +
geom_area(data = sig3l_hh, aes(x = x, y = y), fill = '#440154FF') +
geom_area(data = sig3r_hh, aes(x = x, y = y), fill = '#440154FF') +
geom_area(data = sig2l_hh, aes(x = x, y = y), fill = '#3B528BFF') +
geom_area(data = sig2r_hh, aes(x = x, y = y), fill = '#3B528BFF') +
geom_area(data = sig1l_hh, aes(x = x, y = y), fill = '#21908CFF') +
geom_area(data = sig1r_hh, aes(x = x, y = y), fill = '#21908CFF') +
geom_area(data = sig0l_hh, aes(x = x, y = y), fill = '#5DC863FF') +
geom_area(data = sig0r_hh, aes(x = x, y = y), fill = '#5DC863FF') +
scale_x_continuous(breaks = xpoints_hh, labels = xlabels_hh) +
theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5),
axis.ticks = element_blank(),
panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
axis.line.y = element_blank(), axis.title.y = element_blank(), axis.text.y = element_blank()) +
labs(x = "x", y = "Density",
subtitle = paste0("Mean = ", round(mean_hh, 3), "; SD = ", round(sd_hh, 3), "; Skewness = ", round(skew_hh, 3), "; Kurtosis = ", round(kurt_hh, 3)),
caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, C03)
rm(C03)Distributions with zero excess kurtosis are called mesokurtic. The most prominent example of a mesokurtic distribution is the normal distribution. The kurtosis of any univariate normal distribution is 3.
Distributions with kurtosis less than 3 are said to be platykurtic. It means the distribution produces fewer and less extreme outliers than does the normal distribution. An example of a platykurtic distribution is the uniform distribution, which does not produce outliers.
Distributions with kurtosis greater than 3 are said to be leptokurtic. An example of a leptokurtic distribution is the Laplace distribution, which has tails that asymptotically approach zero more slowly than a Gaussian, and therefore produces more outliers than the normal distribution.
Kurtosis is the average (or expected value) of the standardized data raised to the fourth power. Any standardized values that are less than 1 (i.e., data within one standard deviation of the mean, where the “peak” would be), contribute virtually nothing to kurtosis, since raising a number that is less than 1 to the fourth power makes it closer to zero. The only data values that contribute to kurtosis in any meaningful way are those outside the region of the peak; i.e., the outliers. Therefore, kurtosis measures outliers only; it measures nothing about the “peak.”
The sample kurtosis is a useful measure of whether there is a problem with outliers in a data set. Larger kurtosis indicates a more serious outlier problem.
# #Kurtosis Calculation: Package "e1071" (Package "moments" deprecated)
dis_lst <- list(xxNormal, xxExp, xxBeta)
#
# #Kurtosis: Normal has value close to 3 Kurtosis (=0 excess Kurtosis)
# #Kurtosis "e1071" has Type = 3 as default. Its Type = 1 matches "moments" with difference of 3
kurt_e_t3 <- vapply(dis_lst, e1071::kurtosis, numeric(1))
kurt_e_t2 <- vapply(dis_lst, e1071::kurtosis, type = 2, numeric(1))
kurt_e_t1 <- vapply(dis_lst, e1071::kurtosis, type = 1, numeric(1))
kurt_mmt <- vapply(dis_lst, moments::kurtosis, numeric(1))
stopifnot(identical(round(kurt_e_t1, 10), round(kurt_mmt - 3, 10)))
cat(paste0("e1071: Type = 1 Kurtosis (Normal, Exp, Beta): ",
paste0(round(kurt_e_t1, 4), collapse = ", "), "\n"))
## e1071: Type = 1 Kurtosis (Normal, Exp, Beta): -0.0687, 6.3223, -0.106
cat(paste0("e1071: Type = 2 Kurtosis (Normal, Exp, Beta): ",
paste0(round(kurt_e_t2, 4), collapse = ", "), "\n"))
## e1071: Type = 2 Kurtosis (Normal, Exp, Beta): -0.0682, 6.326, -0.1055
cat(paste0("e1071: Type = 3 Kurtosis (Normal, Exp, Beta): ",
paste0(round(kurt_e_t3, 4), collapse = ", "), "\n"))
## e1071: Type = 3 Kurtosis (Normal, Exp, Beta): -0.0693, 6.3204, -0.1066
#
# #Formula: (sigma_ (x_i - mu)^4) /(n * sd^4)
bb <- xxNormal
kurt_man <- {sum({bb - mean(bb)}^4) / {length(bb) * sd(bb)^4}} - 3
cat(paste0("(Manual) Kurtosis of Normal: ", round(kurt_man, 4),
" (vs. e1071 Type 3 = ", round(kurt_e_t3[1], 4), ") \n"))
## (Manual) Kurtosis of Normal: -0.0693 (vs. e1071 Type 3 = -0.0693)Measures of relative location help us determine how far a particular value is from the mean. By using both the mean and standard deviation, we can determine the relative location of any observation.
\[z_i = \frac{{x}_i - {\overline{x}}}{{s}} \tag{23.14}\]
NOTE: “Z statistic” is a special case of “Z critical” because \(\sigma/\sqrt{n}\) is the ‘standard error of the sample mean’ which means that it is a standard deviation. Rather than (eg) a known population standard deviation or even just sample standard deviation, per CLT, it is the standard deviation of the sample mean. The ‘critical Z’ (i.e. standard score) is something than can always be computed (“a general case”) whenever there is a mean and standard deviation; it translates X into a Z variable with zero mean and unit variance. (it “imposes normality” when the data may not be normal!). The “Z statistic” similarly standardizes as a special case where it is standardizing the sample mean.
Caution:
xxflights <- f_getRDS(xxflights)
bb <- na.omit(xxflights$air_time)
# Scaling
ii <- {bb - mean(bb)} / sd(bb)
str(ii)
## num [1:327346] 0.8145 0.8145 0.0994 0.3449 -0.3702 ...
## - attr(*, "na.action")= 'omit' int [1:9430] 472 478 616 644 726 734 755 839 840 841 ...
# #scale() gives a Matrix with original mean and sd as its attribute
jj <- scale(bb)
str(jj)
## num [1:327346, 1] 0.8145 0.8145 0.0994 0.3449 -0.3702 ...
## - attr(*, "scaled:center")= num 151
## - attr(*, "scaled:scale")= num 93.7
stopifnot(identical(as.vector(ii), as.vector(jj)))
#
hh <- tibble(ee = as.vector(jj))
ttl_hh <- "Flights: Air Time (Scaled)"
cap_hh <- "C03P08" #iiiiFigure 23.2 Before and After Scaling
# #hh$ee ttl_hh cap_hh
#
C03 <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) +
geom_histogram(bins = 50, alpha = 0.4, fill = '#FDE725FF') +
geom_vline(aes(xintercept = mean(.data[["ee"]])), color = '#440154FF') +
annotate(geom = "text", x = mean(.[[1]]), y = -Inf,
label = TeX(r'($\bar{x}$)', output = "character"),
color = '#440154FF', hjust = -2, vjust = -2.5, parse = TRUE) +
coord_cartesian(ylim = c(0, 35000)) +
theme(plot.title.position = "panel") +
labs(x = "x", y = "Frequency",
subtitle = paste0("(Mean= ", round(mean(.[[1]]), 3),
"; SD= ", round(sd(.[[1]]), 3),
")"),
caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, C03)
rm(C03)if(FALSE){
# #check_overlap = TRUE works for de-blurring. However, it still checks each point thus slow
geom_text(aes(label = TeX(r'($\bar{x}$)', output = "character"),
x = mean(.data[["ee"]]), y = -Inf),
color = '#440154FF', hjust = -2, vjust = -2.5, parse = TRUE, check_overlap = TRUE)
# #Create your own dataset
geom_text(data = tibble(x = mean(.[[1]]), y = -Inf,
label = TeX(r'($\bar{x}$)', output = "character")),
aes(x = x, y = y, label = label),
color = '#440154FF', hjust = -2, vjust = -2.5, parse = TRUE )
# #Or Equivalent
ggplot2::annotate(geom = "text", x = mean(.[[1]]), y = -Inf,
label = TeX(r'($\bar{x}$)', output = "character"),
color = '#440154FF', hjust = -2, vjust = -2.5, parse = TRUE)
#
ggpp::annotate(geom = "text", x = mean(.[[1]]), y = -Inf,
label = TeX(r'($\bar{x}$)', output = "character"),
color = '#440154FF', hjust = -2, vjust = -2.5, parse = TRUE)
}# #List All Colour Names in R
str(colors())
## chr [1:657] "white" "aliceblue" "antiquewhite" "antiquewhite1" "antiquewhite2" "antiquewhite3" ...
# #Packages: viridis, scales, viridisLite
# #Show N Colours with Max. Contrast
q_colors <- 5
# #Display Colours
if(FALSE) show_col(viridis_pal()(q_colors))
# #Get the Viridis i.e. "D" palette Hex Values for N Colours
v_colors <- viridis(q_colors, option = "D")
v_colors
## [1] "#440154FF" "#3B528BFF" "#21908CFF" "#5DC863FF" "#FDE725FF"
#
# #Diverging Colour Palette from 'RColorBrewer'
# #Hex Values
brewer.pal(3, "BrBG")
## [1] "#D8B365" "#F5F5F5" "#5AB4AC"
if(FALSE) display.brewer.pal(3, "BrBG")Five-Number Summary is used to quickly summarise a dataset. i.e. Min, Q1, Median, Q3, Max
Figure 23.3 geom_boxplot()
# #nycflights13::weather
bb <- weather
# #NA are present in the data
summary(bb$temp)
#
# #BoxPlot
C03P01 <- bb %>% drop_na(temp) %>% mutate(month = factor(month, ordered = TRUE)) %>% {
ggplot(data = ., mapping = aes(x = month, y = temp)) +
#geom_violin() +
geom_boxplot(aes(fill = month), outlier.colour = 'red', notch = TRUE) +
stat_summary(fun = mean, geom = "point", size = 2, color = "steelblue") +
scale_y_continuous(breaks = seq(0, 110, 10), limits = c(0, 110)) +
#geom_point() +
#geom_jitter(position=position_jitter(0.2)) +
k_gglayer_box +
theme(legend.position = 'none') +
labs(x = "Months", y = "Temperature", subtitle = "With Mean & Notch",
caption = "C03P01", title = "BoxPlot")
}\[\begin{align} \sigma_{xy} &= \frac{\sum (x_i - \mu_x)(y_i - \mu_y)}{n} \\ s_{xy} &= \frac{\sum (x_i - \overline{x})(y_i - \overline{y})}{n-1} \end{align} \tag{23.15}\]
Figure 23.4 Scatter Plot Quadrants for Covariance
# #Get 'Deviation about the mean' i.e. devX and devY and their Product devXY
ii <- bb %>%
mutate(devX = Commercials - mean(Commercials), devY = Sales - mean(Sales), devXY = devX * devY)
#
# #Sample Covariance
sxy <- sum(ii$devXY) / {length(ii$devXY) -1}
print(sxy)
## [1] 11bb <- f_getRDS(xxCommercials)
# #Define the formula for Trendline calculation
k_gg_formula <- y ~ x
#
# #Scatterplot, Trendline Equation, R2, mean x & y
C03P02 <- bb %>% {
ggplot(data = ., aes(x = Commercials, y = Sales)) +
geom_smooth(method = 'lm', formula = k_gg_formula, se = FALSE) +
stat_poly_eq(aes(label = paste0("atop(", ..eq.label.., ", \n", ..rr.label.., ")")),
formula = k_gg_formula, eq.with.lhs = "italic(hat(y))~`=`~",
eq.x.rhs = "~italic(x)", parse = TRUE) +
geom_vline(aes(xintercept = round(mean(Commercials), 3)), color = 'red', linetype = "dashed") +
geom_hline(aes(yintercept = round(mean(Sales), 3)), color = 'red', linetype = "dashed") +
geom_text(aes(label = TeX(r"($\bar{x} = 3$)", output = "character"),
x = round(mean(Commercials), 3), y = -Inf),
color = 'red', , hjust = -0.2, vjust = -0.5, parse = TRUE, check_overlap = TRUE) +
geom_text(aes(label = TeX(r"($\bar{y} = 51$)", output = "character"),
x = Inf, y = round(mean(Sales), 3)),
color = 'red', , hjust = 1.5, vjust = -0.5, parse = TRUE, check_overlap = TRUE) +
geom_point() +
k_gglayer_scatter +
labs(x = "Commercials", y = "Sales ($100s)",
subtitle = TeX(r"(Trendline Equation, $R^{2}$, $\bar{x}$ and $\bar{y}$)"),
caption = "C03P02", title = "Scatter Plot")
}\[\begin{align} \rho_{xy} &= \frac{\sigma_{xy}}{\sigma_{x}\sigma_{y}} \\ r_{xy} &= \frac{s_{xy}}{s_{x}s_{y}} \end{align} \tag{23.16}\]
# #Get 'Deviation about the mean' i.e. devX and devY and their Product devXY
ii <- bb %>%
mutate(devX = Commercials - mean(Commercials), devY = Sales - mean(Sales), devXY = devX * devY)
#
# #Sample Covariance
sxy <- sum(ii$devXY) / {length(ii$devXY) -1}
print(sxy)
## [1] 11
jj <- ii %>% mutate(devXsq = devX * devX, devYsq = devY * devY)
# #Sample Covariance Sx, Sample Standard Deviations Sx Sy
sxy <- sum(ii$devXY) / {nrow(ii) -1}
sx <- round(sqrt(sum(jj$devXsq) / {nrow(jj) -1}), 2)
sy <- round(sqrt(sum(jj$devYsq) / {nrow(jj) -1}), 2)
cat(paste0("Sxy =", sxy, ", Sx =", sx, ", Sy =", sy, "\n"))
## Sxy =11, Sx =1.49, Sy =7.93
#
# #Correlation Coefficient Rxy
rxy <- round(sxy / {sx * sy}, 2)
cat(paste0("Correlation Coefficient Rxy =", rxy, "\n"))
## Correlation Coefficient Rxy =0.93| Week | Commercials | Sales | devX | devY | devXY | devXsq | devYsq |
|---|---|---|---|---|---|---|---|
| 1 | 2 | 50 | -1 | -1 | 1 | 1 | 1 |
| 2 | 5 | 57 | 2 | 6 | 12 | 4 | 36 |
| 3 | 1 | 41 | -2 | -10 | 20 | 4 | 100 |
| 4 | 3 | 54 | 0 | 3 | 0 | 0 | 9 |
| 5 | 4 | 54 | 1 | 3 | 3 | 1 | 9 |
| 6 | 1 | 38 | -2 | -13 | 26 | 4 | 169 |
| 7 | 5 | 63 | 2 | 12 | 24 | 4 | 144 |
| 8 | 3 | 48 | 0 | -3 | 0 | 0 | 9 |
| 9 | 4 | 59 | 1 | 8 | 8 | 1 | 64 |
| 10 | 2 | 46 | -1 | -5 | 5 | 1 | 25 |
\[\begin{align} n! &= \prod _{i=1}^n i = n \cdot (n-1) \\ &= n \cdot(n-1)\cdot(n-2)\cdot(n-3)\cdot\cdots \cdot 3 \cdot 2 \cdot 1 \end{align} \tag{24.1}\]
\[C_k^N = \binom{N}{k} = \frac{N!}{k!(N-k)!} \tag{24.2}\]
\[P_k^N = k! \binom{N}{k} = \frac{N!}{(N-k)!} \tag{24.3}\]
\[P(A \cup B) = P(A) + P(B) - P(A \cap B) \tag{24.4}\]
|
|
\[\begin{align} P(A \cap B) &= P(B) \cdot P(A | B) \\ &= P(A) \cdot P(B | A) \end{align} \tag{24.5}\]
Often, we begin the analysis with initial or prior probability estimates for specific events of interest. Then, from sources such as a sample, a special report, or a product test, we obtain additional information about the events. Given this new information, we update the prior probability values by calculating revised probabilities, referred to as posterior probabilities. Bayes theorem provides a means for making these probability calculations.
\[\begin{align} P(A_{1}|B) &= \frac{P(A_{1})P(B|A_{1})}{P(A_{1}) P(B|A_{1})+ P(A_{2}) P(B|A_{2})} \\ P(A_{2}|B) &= \frac{P(A_{2})P(B|A_{2})}{P(A_{1}) P(B|A_{1})+ P(A_{2}) P(B|A_{2})} \end{align} \tag{24.6}\]
21.13 Quantitative data that measure ‘how many’ are discrete.
21.14 Quantitative data that measure ‘how much’ are continuous because no separation occurs between the possible data values.
| \({x}\) | \(f(x)\) | \(\sum xf(x)\) | \((x - \mu)\) | \((x - \mu)^2\) | \(\sum {(x - \mu)^{2}f(x)}\) |
|---|---|---|---|---|---|
| 0 | 0.18 | 0 | -1.5 | 2.25 | 0.405 |
| 1 | 0.39 | 0.39 | -0.5 | 0.25 | 0.0975 |
| 2 | 0.24 | 0.48 | 0.5 | 0.25 | 0.06 |
| 3 | 0.14 | 0.42 | 1.5 | 2.25 | 0.315 |
| 4 | 0.04 | 0.16 | 2.5 | 6.25 | 0.25 |
| 5 | 0.01 | 0.05 | 3.5 | 12.25 | 0.1225 |
| Total | 1.00 | mu = 1.5 | NA | NA | sigma^2 = 1.25 |
# #Dicarlo: Days with Number of Cars Sold per day for last 300 days
xxdicarlo <- tibble(Cars = 0:5, Days = c(54, 117, 72, 42, 12, 3))
#
bb <- xxdicarlo
bb <- bb %>% rename(x = Cars, Fx = Days) %>% mutate(across(Fx, ~./sum(Fx))) %>%
mutate(xFx = x * Fx, x_mu = x - sum(xFx),
x_mu_sq = x_mu * x_mu, x_mu_sq_Fx = x_mu_sq * Fx)
R_dicarlo_var_y_C05 <- sum(bb$x_mu_sq_Fx)
# #Total Row
bb <- bb %>%
mutate(across(1, as.character)) %>%
add_row(summarise(., across(1, ~"Total")), summarise(., across(where(is.double), sum))) %>%
mutate(xFx = ifelse(x == "Total", paste0("mu = ", xFx), xFx),
x_mu_sq_Fx = ifelse(x == "Total", paste0("sigma^2 = ", x_mu_sq_Fx), x_mu_sq_Fx)) %>%
mutate(across(4:5, ~ replace(., x == "Total", NA)))# #Change Column Classes as required
bb %>% mutate(across(1, as.character))
bb %>% mutate(across(everything(), as.character))bb <- xxdicarlo
ii <- bb %>% rename(x = Cars, Fx = Days) %>% mutate(across(Fx, ~./sum(Fx))) %>%
mutate(xFx = x * Fx, x_mu = x - sum(xFx),
x_mu_sq = x_mu * x_mu, x_mu_sq_Fx = x_mu_sq * Fx)
# #Add Total Row
ii <- ii %>%
mutate(across(1, as.character)) %>%
add_row(summarise(., across(1, ~"Total")), summarise(., across(where(is.double), sum)))
#
# #Modify Specific Row Values without using filter()
# #filter() does not have 'un-filter()' function like group()-ungroup() combination
# #Selecting Row where x = "Total" and changing Column Values for Two Columns
ii <- ii %>%
mutate(xFx = ifelse(x == "Total", paste0("mu = ", xFx), xFx),
x_mu_sq_Fx = ifelse(x == "Total", paste0("sigma^2 = ", x_mu_sq_Fx), x_mu_sq_Fx))
#
# #Selecting Row where x = "Total" and doing same replacement on Two Columns
ii %>% mutate(across(4:5, function(y) replace(y, x == "Total", NA)))
ii %>% mutate(across(4:5, ~ replace(., x == "Total", NA)))
|
|
bb <- xxdicarlo_gs
# #Assuming there is NO Total Column NOR Total Row and First Column is character
kk <- bb %>% summarise(across(where(is.numeric), sum)) %>% summarise(sum(.)) %>% pull(.)
ll <- bb %>% summarise(across(-1, sum)) %>% summarise(sum(.)) %>% pull(.)
stopifnot(identical(kk, ll))
print(kk)
## [1] 300bb <- xxdicarlo_gs
# #Round off values to 1 significant digits i.e. 0.003 or 0.02
# #NOTE: This changes the column to "character"
bb %>% mutate(across(where(is.numeric), ~./sum_bb)) %>%
mutate(across(where(is.numeric), format, digits =1))
## # A tibble: 4 x 7
## Geneva_Saratoga y0 y1 y2 y3 y4 y5
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 x0 0.07 0.10 0.08 0.03 0.007 0.000
## 2 x1 0.07 0.12 0.11 0.06 0.007 0.003
## 3 x2 0.03 0.14 0.03 0.04 0.010 0.007
## 4 x3 0.01 0.03 0.02 0.01 0.017 0.000| \(ID\) | \({s}\) | \(f(s)\) | \(\sum sf(s)\) | \((s - E(s))\) | \((s - E(s))^2\) | \(\sum {(s - E(s))^{2}f(s)}\) |
|---|---|---|---|---|---|---|
| A | 0 | 0.070 | 0.00 | -2.64 | 6.99 | 0.489 |
| B | 1 | 0.170 | 0.17 | -1.64 | 2.70 | 0.459 |
| C | 2 | 0.230 | 0.46 | -0.64 | 0.41 | 0.095 |
| D | 3 | 0.290 | 0.87 | 0.36 | 0.13 | 0.037 |
| E | 4 | 0.127 | 0.51 | 1.36 | 1.84 | 0.233 |
| F | 5 | 0.067 | 0.33 | 2.36 | 5.55 | 0.370 |
| G | 6 | 0.023 | 0.14 | 3.36 | 11.27 | 0.263 |
| H | 7 | 0.023 | 0.16 | 4.36 | 18.98 | 0.443 |
| I | 8 | 0.000 | 0.00 | 5.36 | 28.69 | 0.000 |
| Total | NA | 1.000 | E(s) = 2.64 | NA | NA | Var(s) = 2.389 |
bb <- xxdicarlo_gs
sum_bb <- bb %>% summarise(across(-1, sum)) %>% summarise(sum(.)) %>% pull(.)
# #Convert to Bivariate Probability Distribution
ii <- bb %>% mutate(across(where(is.numeric), ~./sum_bb)) %>% select(-1)
# #Using tapply(), sum the Matrix
jj <- tapply(X= as.matrix(ii), INDEX = LETTERS[row(ii) + col(ii)-1], FUN = sum)
# #Create Tibble
kk <- tibble(Fs = jj, ID = LETTERS[1:length(Fs)], s = 1:length(Fs) - 1) %>%
relocate(Fs, .after = last_col()) %>%
mutate(sFs = s * Fs, s_Es = s - sum(sFs),
s_Es_sq = s_Es * s_Es, s_Es_sq_Fs = s_Es_sq * Fs)
# #Save for Notebook
R_dicarlo_var_s_C05 <- sum(kk$s_Es_sq_Fs)
# #For Printing
ll <- kk %>%
add_row(summarise(., across(1, ~"Total")), summarise(., across(where(is.double), sum))) %>%
mutate(across(where(is.numeric), format, digits =2)) %>%
mutate(sFs = ifelse(ID == "Total", paste0("E(s) = ", sFs), sFs),
s_Es_sq_Fs = ifelse(ID == "Total", paste0("Var(s) = ", s_Es_sq_Fs), s_Es_sq_Fs)) %>%
mutate(across(c(2, 5, 6), ~ replace(., ID == "Total", NA)))bb <- xxdicarlo_gs
# #From the Bivariate get the original data
ii <- bb %>%
mutate(Fx = rowSums(across(where(is.numeric)))) %>%
select(1, 8) %>%
separate(col = Geneva_Saratoga, into = c(NA, "x"), sep = 1) %>%
mutate(across(1, as.integer))
# #Variance Calculation
jj <- ii %>% mutate(across(Fx, ~./sum(Fx))) %>%
mutate(xFx = x * Fx, x_mu = x - sum(xFx),
x_mu_sq = x_mu * x_mu, x_mu_sq_Fx = x_mu_sq * Fx)
# #Save for Notebook
R_dicarlo_var_x_C05 <- sum(jj$x_mu_sq_Fx)
print(jj)
## # A tibble: 4 x 6
## x Fx xFx x_mu x_mu_sq x_mu_sq_Fx
## <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0.287 0 -1.14 1.31 0.375
## 2 1 0.37 0.37 -0.143 0.0205 0.00760
## 3 2 0.257 0.513 0.857 0.734 0.188
## 4 3 0.0867 0.26 1.86 3.45 0.299bb <- xxdicarlo_gs
#
# #Tibble Total SUM
sum_bb <- bb %>% summarise(across(-1, sum)) %>% summarise(sum(.)) %>% pull(.)
#
# #Convert to Bivirate Probability Distribution and Exclude First Character Column
ii <- bb %>% mutate(across(where(is.numeric), ~./sum_bb)) %>% select(-1)
#
# #(1A, 2B, 3C, 4D, 4E, 4F, 3G, 2H, 1I) 9 Unique Combinations = 24 (4x6) Experimental Outcomes
matrix(data = LETTERS[row(ii) + col(ii)-1], nrow = 4)
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] "A" "B" "C" "D" "E" "F"
## [2,] "B" "C" "D" "E" "F" "G"
## [3,] "C" "D" "E" "F" "G" "H"
## [4,] "D" "E" "F" "G" "H" "I"
#
# #Using tapply(), sum the Matrix
jj <- tapply(X= as.matrix(ii), INDEX = LETTERS[row(ii) + col(ii)-1], FUN = sum)
print(jj)
## A B C D E F G H I
## 0.07000000 0.17000000 0.23000000 0.29000000 0.12666667 0.06666667 0.02333333 0.02333333 0.00000000
# #In place of LETTERS, Numerical Index can also be used but Letters are more clear for grouping
#tapply(X= as.matrix(ii), INDEX = c(0:8)[row(ii) + col(ii)-1], FUN = sum)
#
# #Create Tibble
kk <- tibble(Fs = jj, ID = LETTERS[1:length(Fs)], s = 1:length(Fs) - 1) %>%
relocate(Fs, .after = last_col())
print(kk)
## # A tibble: 9 x 3
## ID s Fs
## <chr> <dbl> <dbl>
## 1 A 0 0.07
## 2 B 1 0.17
## 3 C 2 0.23
## 4 D 3 0.29
## 5 E 4 0.127
## 6 F 5 0.0667
## 7 G 6 0.0233
## 8 H 7 0.0233
## 9 I 8 0bb <- xxdicarlo_gs
# #Separate String based on Position
bb %>% separate(col = Geneva_Saratoga, into = c("A", "B"), sep = 1)
## # A tibble: 4 x 8
## A B y0 y1 y2 y3 y4 y5
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 x 0 21 30 24 9 2 0
## 2 x 1 21 36 33 18 2 1
## 3 x 2 9 42 9 12 3 2
## 4 x 3 3 9 6 3 5 0\[\sigma_{xy} = \frac{\text{Var}(x + y) - \text{Var}(x) - \text{Var}(y)}{2} \tag{25.1}\]
dbinom(), pbinom(), qbinom(), rbinom()
dpois(), ppois(), qpois(), rpois()
25.2 A random variable that may assume either a finite number of values or an infinite sequence of values such as \(0, 1, 2, \dots\) is referred to as a discrete random variable. It includes factor type i.e. Male as 0, Female as 1 etc.
25.3 A random variable that may assume any numerical value in an interval or collection of intervals is called a continuous random variable. It is given by \(x \in [n, m]\). If the entire line segment between the two points also represents possible values for the random variable, then the random variable is continuous.
\[\begin{align} E(x) &= \frac{a + b}{2} \\ \text{Var}(x) &= \frac{(b - a)^2}{12} \end{align} \tag{26.1}\]
\[f(x) = {\frac {1}{{\sigma}{\sqrt {2 \pi}}}} e^{-{\frac {1}{2}}\left( {\frac {x-{\mu} }{\sigma}}\right) ^{2}} \tag{26.2}\]
Figure 26.1 Normal Distribution
# #Histogram with Density Curve, Mean and Median: Normal Distribution
ee <- f_getRDS(xxNormal)
hh <- tibble(ee)
ee <- NULL
# #Basics
median_hh <- round(median(hh[[1]]), 3)
mean_hh <- round(mean(hh[[1]]), 3)
sd_hh <- round(sd(hh[[1]]), 3)
len_hh <- nrow(hh)
#
# #Base Plot: Creates Only Density Function Line
ii <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) + geom_density() }
#
# #Change the line colour and alpha
ii <- ii + geom_density(alpha = 0.2, colour = "#21908CFF")
#
# #Add Histogram with 50 bins, alpha and fill
ii <- ii + geom_histogram(aes(y = ..density..), bins = 50, alpha = 0.4, fill = '#FDE725FF')
#
# #Full Vertical Line at Mean. Goes across Function Boundary on Y-Axis
#ii <- ii + geom_vline(aes(xintercept = mean_hh), color = '#440154FF')
#
# #Shaded Area Object for line /Area upto the the Function Boundary on Y-Axis
# #Mean
ii_mean <- ggplot_build(ii)$data[[1]] %>% filter(x <= mean_hh)
# #Median
ii_median <- ggplot_build(ii)$data[[1]] %>% filter(x <= median_hh)
#
# #To show values which are less than Mean in colour
#ii <- ii + geom_area(data = ii_mean, aes(x = x, y = y), fill = 'blue', alpha = 0.5)
#
# #Line upto the Density Curve at Mean
ii <- ii + geom_segment(data = ii_mean,
aes(x = mean_hh, y = 0, xend = mean_hh, yend = density), color = "#440154FF")
#
# #Label 'Mean'
ii <- ii + geom_text(aes(label = paste0("Mean= ", mean_hh), x = mean_hh, y = -Inf),
color = '#440154FF', hjust = -0.5, vjust = -1, angle = 90, check_overlap = TRUE)
#
# #Similarly, Median Line and Label
ii <- ii + geom_segment(data = ii_median,
aes(x = median_hh, y = 0, xend = median_hh, yend = density), color = "#3B528BFF") +
geom_text(aes(label = paste0("Median= ", median_hh), x = median_hh, y = -Inf),
color = '#3B528BFF', hjust = -0.4, vjust = 1.2, angle = 90, check_overlap = TRUE)
#
# #Change Axis Limits
ii <- ii + coord_cartesian(xlim = c(-5, 5), ylim = c(0, 0.5))
#
# #Change x-Axis Ticks interval
xbreaks_hh <- seq(-3, 3)
xpoints_hh <- mean_hh + xbreaks_hh * sd_hh
# # Latex Labels
xlabels_hh <- c(TeX(r'($\,\,\mu - 3 \sigma$)'), TeX(r'($\,\,\mu - 2 \sigma$)'),
TeX(r'($\,\,\mu - 1 \sigma$)'), TeX(r'($\mu$)'), TeX(r'($\,\,\mu + 1 \sigma$)'),
TeX(r'($\,\,\mu + 2 \sigma$)'), TeX(r'($\,\,\mu + 3\sigma$)'))
#
ii <- ii + scale_x_continuous(breaks = xpoints_hh, labels = xlabels_hh)
#
# #Get Quantiles and Ranges of mean +/- sigma
q05_hh <- quantile(hh[[1]], .05)
q95_hh <- quantile(hh[[1]], .95)
density_hh <- density(hh[[1]])
density_hh_tbl <- tibble(x = density_hh$x, y = density_hh$y)
sig3l_hh <- density_hh_tbl %>% filter(x <= mean_hh - 3 * sd_hh)
sig3r_hh <- density_hh_tbl %>% filter(x >= mean_hh + 3 * sd_hh)
sig2r_hh <- density_hh_tbl %>% filter(x >= mean_hh + 2 * sd_hh, x < mean_hh + 3 * sd_hh)
sig2l_hh <- density_hh_tbl %>% filter(x <= mean_hh - 2 * sd_hh, x > mean_hh - 3 * sd_hh)
sig1r_hh <- density_hh_tbl %>% filter(x >= mean_hh + sd_hh, x < mean_hh + 2 * sd_hh)
sig1l_hh <- density_hh_tbl %>% filter(x <= mean_hh - sd_hh, x > mean_hh - 2 * sd_hh)
#
# #Use (mean +/- 3 sigma) To Highlight. NOT ALL Zones have been highlighted
ii <- ii + geom_area(data = sig3l_hh, aes(x = x, y = y), fill = 'red') +
geom_area(data = sig3r_hh, aes(x = x, y = y), fill = 'red')
#
# #Annotate Arrows
ii <- ii +
# ggplot2::annotate("segment", x = xpoints_hh[4] -0.5 , xend = xpoints_hh[3], y = 0.42,
# yend = 0.42, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
# ggplot2::annotate("segment", x = xpoints_hh[4] -0.5 , xend = xpoints_hh[2], y = 0.45,
# yend = 0.45, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
ggplot2::annotate("segment", x = xpoints_hh[4] -0.5 , xend = xpoints_hh[1], y = 0.48,
yend = 0.48, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
# ggplot2::annotate("segment", x = xpoints_hh[4] +0.5 , xend = xpoints_hh[5], y = 0.42,
# yend = 0.42, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
# ggplot2::annotate("segment", x = xpoints_hh[4] +0.5 , xend = xpoints_hh[6], y = 0.45,
# yend = 0.45, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
ggplot2::annotate("segment", x = xpoints_hh[4] +0.5 , xend = xpoints_hh[7], y = 0.48,
yend = 0.48, arrow = arrow(type = "closed", length = unit(0.02, "npc")))
#
# #Annotate Labels
ii <- ii +
# ggplot2::annotate(geom = "text", x = xpoints_hh[4], y = 0.42, label = "68.3%") +
# ggplot2::annotate(geom = "text", x = xpoints_hh[4], y = 0.45, label = "95.4%") +
ggplot2::annotate(geom = "text", x = xpoints_hh[4], y = 0.48, label = "99.7%")
#
# #Add a Theme and adjust Position of Title & Subtile (Both by plot.title.position) & Caption
# #"plot" or "panel"
ii <- ii + theme(#plot.tag.position = "topleft",
#plot.caption.position = "plot",
#plot.caption = element_text(hjust = 0),
plot.title.position = "panel")
#
# #Title, Subtitle, Caption, Axis Labels, Tag
ii <- ii + labs(x = "x", y = "Density",
subtitle = paste0("(N=", len_hh, "; ", "Mean= ", mean_hh,
"; Median= ", median_hh, "; SD= ", sd_hh),
caption = "C06AA", tag = NULL,
title = "Normal Distribution (Symmetrical)")
#
#ii# #Syntax
#latex2exp::Tex(r('$\sigma =10$'), output = "character")
# #Test Equation
plot(TeX(r'(abc: $\frac{2hc^2}{\lambda^5} \, \frac{1}{e^{\frac{hc}{\lambda k_B T}} - 1}$)'), cex=2)
plot(TeX(r'(xyz: $f(x) =\frac{1}{\sigma \sqrt{2\pi}}\, e^{- \, \frac{1}{2} \,\left(\frac{x - \mu}{\sigma}\right)^2} $)'), cex=2)# #Syntax
ggpp::annotate("text", x = -2, y = 0.3, label=TeX(r'($\sigma =10$)', output = "character"), parse = TRUE, check_overlap = TRUE)
# #NOTE: Complex Equations like Normal Distribution are crashing the R.
ggpp::annotate("text", x = -2, y = 0.3, label=TeX(r'($f(x) =\frac{1}{\sigma \sqrt{2\pi}}\, e^{- \, \frac{1}{2} \, \left(\frac{x - \mu}{\sigma}\right)^2} $)', output = "character"), parse = TRUE, check_overlap = TRUE)# #Data
bb <- f_getRDS(xxNormal)
hh <- tibble(bb)
# #Base Plot
ii <- hh %>% { ggplot(data = ., mapping = aes(x = bb)) + geom_density() }
# #Attributes
attributes(ggplot_build(ii))$names
## [1] "data" "layout" "plot"
#
str(ggplot_build(ii)$data[[1]])
## 'data.frame': 512 obs. of 18 variables:
## $ y : num 0.000504 0.00052 0.000532 0.000541 0.000545 ...
## $ x : num -3.63 -3.61 -3.6 -3.58 -3.57 ...
## $ density : num 0.000504 0.00052 0.000532 0.000541 0.000545 ...
## $ scaled : num 0.00126 0.0013 0.00133 0.00136 0.00137 ...
## $ ndensity : num 0.00126 0.0013 0.00133 0.00136 0.00137 ...
## $ count : num 5.04 5.2 5.32 5.41 5.45 ...
## $ n : int 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 ...
## $ flipped_aes: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ PANEL : Factor w/ 1 level "1": 1 1 1 1 1 1 1 1 1 1 ...
## $ group : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ ymin : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ymax : num 0.000504 0.00052 0.000532 0.000541 0.000545 ...
## $ fill : logi NA NA NA NA NA NA ...
## $ weight : num 1 1 1 1 1 1 1 1 1 1 ...
## $ colour : chr "black" "black" "black" "black" ...
## $ alpha : logi NA NA NA NA NA NA ...
## $ size : num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
## $ linetype : num 1 1 1 1 1 1 1 1 1 1 ....[[1]] is discouraged. Use .data[[1]] instead.”STOP! STOP! Just STOP! using UNICODE for R Console on WINDOWS (UTF-8 Issue).
\[f(z) = \varphi(x) = \frac{1}{\sqrt{2\pi}}e^{-\frac{z^2}{2}} \tag{26.3}\]
23.20 The z-score, \({z_i}\), can be interpreted as the number of standard deviations \({x_i}\) is from the mean \({\overline{x}}\). It is associated with each \({x_i}\). The z-score is often called the standardized value or standard score.
\({z} \in \mathbb{R} \iff P_{(z)} \in (0, 1)\)
# #Find Commulative Probability P corresponding to the given 'z' value
# #Area under the curve to the left of z-value = 1.00
pnorm(q = 1.00)
## [1] 0.8413447# #Find Commulative Probability P corresponding to the given 'z' value
# #Area under the curve to the left of z-value = 1.00
# #pnorm(q = 1.00) #default 'lower.tail = TRUE'
z_ii <- 1.00
p_ii <- round(pnorm(q = z_ii, lower.tail = TRUE), 4)
cat(paste0("P(z <= ", format(z_ii, nsmall = 3), ") = ", p_ii, "\n"))
## P(z <= 1.000) = 0.8413
#
# #Probability that z is in the interval between −.50 and 1.25 #0.5859
z_min_ii <- -0.50
z_max_ii <- 1.25
p_ii <- round(pnorm(q = z_max_ii, lower.tail = TRUE) - pnorm(q = z_min_ii, lower.tail = TRUE), 4)
cat(paste0("P(", format(z_min_ii, nsmall = 3), " <= z <= ",
format(z_max_ii, nsmall = 3), ") = ", p_ii, "\n"))
## P(-0.500 <= z <= 1.250) = 0.5858
#
# #Probability of obtaining a z value of at least 1.58 #0.0571
z_ii <- 1.58
p_ii <- round(pnorm(q = z_ii, lower.tail = FALSE), 4)
cat(paste0("P(z >= ", format(z_ii, nsmall = 3), ") = ", p_ii, "\n"))
## P(z >= 1.580) = 0.0571
#
# #Probability that the z is within one standard deviation of the mean i.e. [-1, 1] #0.6826
z_min_ii <- -1.00
z_max_ii <- 1.00
p_ii <- round(pnorm(q = z_max_ii, lower.tail = TRUE) - pnorm(q = z_min_ii, lower.tail = TRUE), 4)
cat(paste0("P(", format(z_min_ii, nsmall = 3), " <= z <= ",
format(z_max_ii, nsmall = 3), ") = ", p_ii, "\n"))
## P(-1.000 <= z <= 1.000) = 0.6827# #Find a z value such that the probability of obtaining a larger z value is .10
# #z-value for which Area under the curve towards Right is 0.10
qnorm(p = 1 - 0.10)
## [1] 1.281552
qnorm(p = 0.10, lower.tail = FALSE)
## [1] 1.281552# #Find a z value such that the probability of obtaining a larger z value is .10
# #z-value for which Area under the curve towards Right is 0.10 i.e. right >10%
#qnorm(p = 1 - 0.10)
#qnorm(p = 0.10, lower.tail = FALSE)
p_r_ii <- 0.10
p_l_ii <- 1 - p_r_ii
z_ii <- round(qnorm(p = p_l_ii, lower.tail = TRUE), 4)
z_jj <- round(qnorm(p = p_r_ii, lower.tail = FALSE), 4)
stopifnot(identical(z_ii, z_jj))
cat(paste0("(Left) P(z) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(z) = ",
format(p_r_ii, nsmall = 3), ") at z = ", z_ii, "\n"))
## (Left) P(z) = 0.900 (i.e. (Right) 1-P(z) = 0.100) at z = 1.2816\[z = \frac{x - {\mu}}{{\sigma}} \tag{26.4}\]
Reasons to convert normal distributions into the standard normal distribution:
Each z-score is associated with a probability, or p-value, that gives the likelihood of values below that z-score occurring. By converting an individual value into a z-score, we can find the probability of all values up to that value occurring in a normal distribution.
The z-score is the test statistic used in a z-test. The z-test is used to compare the means of two groups, or to compare the mean of a group to a set value. Its null hypothesis typically assumes no difference between groups.
The area under the curve to the right of a z-score is the p-value, and it’s the likelihood of your observation occurring if the null hypothesis is true.
Usually, a p-value of 0.05 or less means that your results are unlikely to have arisen by chance; it indicates a statistically significant effect.
# #For N(mu =10, sd =2) Probability that X is in [10, 14]
# #Same as P(0 <= z <= 2)
mu_ii <- 10
sd_ii <- 2
x_min_ii <- 10
x_max_ii <- 14
#
z_min_ii <- (x_min_ii - mu_ii) /sd_ii #0
z_max_ii <- (x_max_ii - mu_ii) /sd_ii #2
#
pz_ii <- round(pnorm(q = z_max_ii, lower.tail = TRUE) - pnorm(q = z_min_ii, lower.tail = TRUE), 4)
# #OR
px_ii <- round(pnorm(q = x_max_ii, mean = mu_ii, sd = sd_ii, lower.tail = TRUE) -
pnorm(q = x_min_ii, mean = mu_ii, sd = sd_ii, lower.tail = TRUE), 4)
stopifnot(identical(pz_ii, px_ii))
cat(paste0("P(", format(z_min_ii, nsmall = 3), " <= z <= ",
format(z_max_ii, nsmall = 3), ") = ", pz_ii, "\n"))
## P(0.000 <= z <= 2.000) = 0.4772
cat(paste0("P(", x_min_ii, " <= x <= ", x_max_ii, ") = ", format(px_ii, nsmall = 3), "\n"))
## P(10 <= x <= 14) = 0.4772# #Grear Tire N(mu = 36500, sd =5000)
# #Probability that the tire mileage, x, will exceed 40000 # 24.2% Tires
mu_ii <- 36500
sd_ii <- 5000
x_ii <- 40000
#
z_ii <- (x_ii - mu_ii)/sd_ii
#
#pnorm(q = 40000, mean = 36500, sd = 5000, lower.tail = FALSE)
pz_ii <- round(pnorm(q = z_ii, lower.tail = FALSE), 4)
px_ii <- round(pnorm(q = x_ii, mean = mu_ii, sd = sd_ii, lower.tail = FALSE), 4)
stopifnot(identical(px_ii, pz_ii))
#
cat(paste0("P(x >= ", x_ii, ") = ", format(px_ii, nsmall = 4), " (",
round(100* px_ii, 2), "%)\n"))
## P(x >= 40000) = 0.2420 (24.2%)
#
# #What should the guarantee mileage be if no more than 10% of the tires to be eligible
# #for the discount guarantee i.e. left <10% # ~30100 miles
p_l_ii <- 0.10
p_r_ii <- 1 - p_l_ii
#
#qnorm(p = 0.10, mean = 36500, sd = 5000)
z_ii <- round(qnorm(p = p_l_ii, lower.tail = TRUE), 4)
xz_ii <- z_ii * sd_ii + mu_ii
#
x_ii <- round(qnorm(p = p_l_ii, mean = mu_ii, sd = sd_ii, lower.tail = TRUE), 4)
stopifnot(abs(xz_ii - x_ii) < 1)
cat(paste0("(Left) P(x) = ", p_l_ii, " (i.e. (Right) 1-P(z) = ", p_r_ii,
") at x = ", round(x_ii, 1), "\n"))
## (Left) P(x) = 0.1 (i.e. (Right) 1-P(z) = 0.9) at x = 30092.221.2 Elements are the entities on which data are collected. (Generally ROWS)
21.3 A variable is a characteristic of interest for the elements. (Generally COLUMNS)
21.20 A population is the set of all elements of interest in a particular study.
21.21 A sample is a subset of the population.
21.22 The measurable quality or characteristic is called a Population Parameter if it is computed from the population. It is called a Sample Statistic if it is computed from a sample.
The sample contains only a portion of the population. Some sampling error is to be expected. So, the sample results provide only estimates of the values of the corresponding population characteristics.
Random sample vs. SRS
Suppose, from a Population, we take a sample of size \({n}\) and calculate point estimate mean \(\overline{x}_{1}\). Further, we can select another random sample from the Population and get another point estimate mean \(\overline{x}_{2}\). If we repeat this process for 500 times, we will have a frame of \(\{\overline{x}_{1}, \overline{x}_{2}, \ldots, \overline{x}_{500}\}\).
If we consider the process of selecting a simple random sample as an experiment, the sample mean \({\overline{x}}\) is the numerical description of the outcome of the experiment. Thus, the sample mean \({\overline{x}}\) is a random variable. As a result, just like other random variables, \({\overline{x}}\) has a mean or expected value, a standard deviation, and a probability distribution. Because the various possible values of \({\overline{x}}\) are the result of different simple random samples, the probability distribution of \({\overline{x}}\) is called the sampling distribution of \({\overline{x}}\). Knowledge of this sampling distribution and its properties will enable us to make probability statements about how close the sample mean \({\overline{x}}\) is to the population mean \({\mu}\).
Just as with other probability distributions, the sampling distribution of \({\overline{x}}\) has an expected value or mean, a standard deviation, and a characteristic shape or form.
\[\begin{align} \text{Finite Population:} \sigma_{\overline{x}} &= \sqrt{\frac{N - n}{N-1}}\left(\frac{\sigma}{\sqrt{n}} \right) \\ \text{Infinite Population:} \sigma_{\overline{x}} &= \frac{\sigma}{\sqrt{n}} \end{align} \tag{27.1}\]
Refer Effect of Sample Size and Repeat Sampling
Figure 27.1 Effect of Sample Size vs Repeat Sampling
“ForLater”
If a statistically independent sample of \({n}\) observations \({{x}_1, {x}_2, \ldots, {x}_n}\) is taken from a statistical population with a standard deviation of \(\sigma\), then the mean value calculated from the sample \(\overline{x}\) will have an associated standard error of the mean \(\sigma_\overline{x}\) given by
\[\sigma_\overline{x} = \frac{\sigma}{\sqrt{n}} \tag{27.2}\]
The standard deviation \(\sigma\) of the population being sampled is seldom known. Therefore, \(\sigma_\overline{x}\) is usually estimated by replacing \(\sigma\) with the sample standard deviation \(\sigma_{x}\) instead:
\[\sigma_\overline{x} \approx \frac{\sigma_{x}}{\sqrt{n}} \tag{27.3}\]
As this is only an ‘estimator’ for the true “standard error,” other notations are used, such as:
\[\widehat{\sigma}_\overline{x} = \frac{\sigma_{x}}{\sqrt{n}} \tag{27.4}\]
OR:
\[{s}_\overline{x} = \frac{s}{\sqrt{n}} \tag{27.5}\]
Key:
Non-mathematical view:
Form of the Sampling Distribution of \({\overline{x}}\)
When the population has a normal distribution, the sampling distribution of \({\overline{x}}\) is normally distributed for any sample size.
When the population from which we are selecting a random sample does not have a normal distribution, the central limit theorem is helpful in identifying the shape of the sampling distribution of \({\overline{x}}\).
How large the sample size needs to be before the central limit theorem applies and we can assume that the shape of the sampling distribution is approximately normal
Task of developing a profile of 2500 managers. The characteristics to be identified include the mean annual salary for the managers and the proportion of managers having completed a training.
Caution: Here, we took advantage of the fact that the population mean \({\mu}\) and the population standard deviation \({\sigma}\) were known. However, usually these values will be unknown.
Properties of Point Estimators
Three properties of good point estimators: unbiased, efficiency, and consistency.
\(\theta = \text{the population parameter of interest}\) \(\hat{\theta} = \text{the sample statistic or point estimator of } \theta\)
Other Sampling Methods
In order to develop an interval estimate of a population mean, either the population standard deviation \({\sigma}\) or the sample standard deviation \({s}\) must be used to compute the margin of error. In most applications \({\sigma}\) is not known, and \({s}\) is used to compute the margin of error.
In some applications, large amounts of relevant historical data are available and can be used to estimate the population standard deviation prior to sampling. Also, in quality control applications where a process is assumed to be operating correctly, or ‘in control,’ it is appropriate to treat the population standard deviation as known.
Sampling distribution of \({\overline{x}}\) can be used to compute the probability that \({\overline{x}}\) will be within a given distance of \({\mu}\).
Example: Lloyd Department Store
Interval Estimate of a Population Mean: \({\sigma}\) known is given by equation (28.1)
\[\begin{align} \overline{x} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}} \end{align} \tag{28.1}\]
where \((1 − \alpha)\) is the confidence coefficient and \(z_{\alpha/2}\) is the z-value providing an area of \(\alpha/2\) in the upper tail of the standard normal probability distribution.
For a 95% confidence interval, the confidence coefficient is \((1 − \alpha) = 0.95\) and thus, \(\alpha = 0.05\). Using the standard normal probability table, an area of \(\alpha/2 = 0.05/2 = 0.025\) in the upper tail provides \(z_{.025} = 1.96\).
# #Find z-value for confidence interval 95% i.e. (1-alpha) = 0.95 i.e. alpha = 0.05
# #To look for Area under the curve towards Right only i.e. alpha/2 = 0.025
p_r_ii <- 0.025
p_l_ii <- 1 - p_r_ii
z_ii <- round(qnorm(p = p_l_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(z) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(z) = ",
format(p_r_ii, nsmall = 3), ") at z = ", z_ii, "\n"))
## (Left) P(z) = 0.975 (i.e. (Right) 1-P(z) = 0.025) at z = 1.96
#
# #Critical Value (z) for Common Significance level Alpha (α) or Confidence level (1-α)
xxalpha <- c("10%" = 0.1, "5%" = 0.05, "5/2%" = 0.025, "1%" = 0.01, "1/2%" = 0.005)
#
# #Left Tail Test
round(qnorm(p = xxalpha, lower.tail = TRUE), 4)
## 10% 5% 5/2% 1% 1/2%
## -1.2816 -1.6449 -1.9600 -2.3263 -2.5758
#
# #Right Tail Test
round(qnorm(p = xxalpha, lower.tail = FALSE), 4)
## 10% 5% 5/2% 1% 1/2%
## 1.2816 1.6449 1.9600 2.3263 2.5758The t distribution is a family of similar probability distributions, with a specific t distribution depending on a parameter known as thedegrees of freedom. As the number of degrees of freedom increases, the difference between the t distribution and the standard normal distribution becomes smaller and smaller.
Just as \(z_{0.025}\) was used to indicate the z value providing a 0.025 area in the upper tail of a standard normal distribution, \(t_{0.025}\) indicates a 0.025 area in the upper tail of a t distribution. In general, the notation \(t_{\alpha/2}\) represents a t value with an area of \(\alpha/2\) in the upper tail of the t distribution.
As the degrees of freedom increase, the t distribution approaches the standard normal distribution. Ex: \(t_{0.025} = 2.262 \, (\text{DOF} = 9)\), \(t_{0.025} = 2.200 \, (\text{DOF} = 60)\), and \(t_{0.025} = 1.96 \, (\text{DOF} = \infty) = z_{0.025}\)
Interval Estimate of a Population Mean: \({\sigma}\) Unknown is given by equation (28.2)
\[\begin{align} \overline{x} \pm t_{\alpha/2} \frac{s}{\sqrt{n}} \end{align} \tag{28.2}\]
where \({s}\) is the sample standard deviation, \((1 − \alpha)\) is the confidence coefficient and \(t_{\alpha/2}\) is the t-value providing an area of \(\alpha/2\) in the upper tail of the t distribution with \({n-1}\) degrees of freedom.
Refer equation (23.12), the expression for the sample standard deviation is
\[{s} = \sqrt{\frac{\sum \left(x_i - \overline{x}\right)^2}{n-1}}\]
Why \((n-1)\) are the degrees of freedom
Larger sample sizes are needed if the distribution of the population is highly skewed or includes outliers.
# #Like pnorm() is for P(z) and qnorm() is for z, pt() is for P(t) and qt() is for t.
# #Find t-value for confidence interval 95% i.e. (1-alpha) = 0.95 i.e. alpha = 0.05
# #To look for Area under the curve towards Right only i.e. alpha/2 = 0.025
p_r_ii <- 0.025
p_l_ii <- 1 - p_r_ii
#
# #t-tables are unique for different degrees of freedom i.e. for DOF = 9
dof_ii <- 9
t_ii <- round(qt(p = p_l_ii, df = dof_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(t) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(t) = ",
format(p_r_ii, nsmall = 3), ") at t = ", t_ii, " (dof = ", dof_ii, ")\n"))
## (Left) P(t) = 0.975 (i.e. (Right) 1-P(t) = 0.025) at t = 2.2622 (dof = 9)# #Like pnorm() is for P(z) and qnorm() is for z, pt() is for P(t) and qt() is for t.
# #Find t-value for confidence interval 95% i.e. (1-alpha) = 0.95 i.e. alpha = 0.05
# #To look for Area under the curve towards Right only i.e. alpha/2 = 0.025
p_r_ii <- 0.025
p_l_ii <- 1 - p_r_ii
#
# #t-tables are unique for different degrees of freedom i.e. for DOF = 9
dof_ii <- 9
t_ii <- round(qt(p = p_l_ii, df = dof_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(t) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(t) = ",
format(p_r_ii, nsmall = 3), ") at t = ", t_ii, " (dof = ", dof_ii, ")\n"))
## (Left) P(t) = 0.975 (i.e. (Right) 1-P(t) = 0.025) at t = 2.2622 (dof = 9)
#
dof_ii <- 60
t_ii <- round(qt(p = p_l_ii, df = dof_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(t) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(t) = ",
format(p_r_ii, nsmall = 3), ") at t = ", t_ii, " (dof = ", dof_ii, ")\n"))
## (Left) P(t) = 0.975 (i.e. (Right) 1-P(t) = 0.025) at t = 2.0003 (dof = 60)
#
dof_ii <- 600
t_ii <- round(qt(p = p_l_ii, df = dof_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(t) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(t) = ",
format(p_r_ii, nsmall = 3), ") at t = ", t_ii, " (dof = ", dof_ii, ")\n"))
## (Left) P(t) = 0.975 (i.e. (Right) 1-P(t) = 0.025) at t = 1.9639 (dof = 600)
#
# #t-table have Infinity Row which is same as z-table. For DOF >100, it can be used.
dof_ii <- Inf
t_ii <- round(qt(p = p_l_ii, df = dof_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(t) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(t) = ",
format(p_r_ii, nsmall = 3), ") at t = ", t_ii, " (dof = ", dof_ii, ")\n"))
## (Left) P(t) = 0.975 (i.e. (Right) 1-P(t) = 0.025) at t = 1.96 (dof = Inf)
#
z_ii <- round(qnorm(p = p_l_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(z) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(z) = ",
format(p_r_ii, nsmall = 3), ") at z = ", z_ii, "\n"))
## (Left) P(z) = 0.975 (i.e. (Right) 1-P(z) = 0.025) at z = 1.96# #A sample of n = 70 households provided the credit card balances.
xxCreditCards <- c(9430, 7535, 4078, 5604, 5179, 4416, 10676, 1627, 10112, 6567, 13627, 18719, 14661, 12195, 10544, 13659, 7061, 6245, 13021, 9719, 2200, 10746, 12744, 5742, 7159, 8137, 9467, 12595, 7917, 11346, 12806, 4972, 11356, 7117, 9465, 19263, 9071, 3603, 16804, 13479, 14044, 6817, 6845, 10493, 615, 13627, 12557, 6232, 9691, 11448, 8279, 5649, 11298, 4353, 3467, 6191, 12851, 5337, 8372, 7445, 11032, 6525, 5239, 6195, 12584, 15415, 15917, 12591, 9743, 10324)
f_setRDS(xxCreditCards)bb <- f_getRDS(xxCreditCards)
mean_bb <- mean(bb)
sd_bb <- sd(bb)
dof_bb <- length(bb) - 1L
# #t-value for confidence interval 95% | (1-alpha) = 0.95 | alpha = 0.05 | alpha/2 = 0.025
p_r_ii <- 0.025
p_l_ii <- 1 - p_r_ii
#
dof_ii <- dof_bb
t_ii <- round(qt(p = p_l_ii, df = dof_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(t) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(t) = ",
format(p_r_ii, nsmall = 3), ") at t = ", t_ii, " (dof = ", dof_ii, ")\n"))
## (Left) P(t) = 0.975 (i.e. (Right) 1-P(t) = 0.025) at t = 1.9949 (dof = 69)
#
# #Interval Estimate
err_margin_bb <- t_ii * sd_bb / sqrt(length(bb))
est_l <- mean_bb - err_margin_bb
est_r <- mean_bb + err_margin_bb
#
cat(paste0("Normal Sample (n=", length(bb), ", mean=", mean_bb, ", sd=", round(sd_bb, 1),
"):\n Point Estimate = ", mean_bb, ", Margin of error = ", round(err_margin_bb, 1),
", ", (1-2*p_r_ii) * 100, "% confidence interval is [",
round(est_l, 1), ", ", round(est_r, 1), "]"))
## Normal Sample (n=70, mean=9312, sd=4007):
## Point Estimate = 9312, Margin of error = 955.4, 95% confidence interval is [8356.6, 10267.4]Note:
Note:
All hypothesis testing applications involve collecting a sample and using the sample results to provide evidence for drawing a conclusion.
In some situations it is easier to identify the alternative hypothesis first and then develop the null hypothesis.
For hypothesis tests involving a population mean, we let \({\mu}_0\) denote the hypothesized value and we must choose one of the following three forms for the hypothesis test.
Alternative is One-Sided, if it states that a parameter is larger or smaller than the null value. Alternative is Two-sided, if it states that the parameter is different from the null value.
Refer Equality in Hypothesis
Refer Type I and Type II Errors (B12)
Ideally the hypothesis testing procedure should lead to the acceptance of \({H_0}\) when \({H_0}\) is true and the rejection of \({H_0}\) when \({H_a}\) is true. Unfortunately, the correct conclusions are not always possible. Because hypothesis tests are based on sample information, we must allow for the possibility of errors.
Figure 29.1 Type-I \((\alpha)\) and Type-II \((\beta)\) Errors
28.3 The confidence level expressed as a decimal value is the confidence coefficient \(({\gamma} = 1 - {\alpha})\). i.e. 0.95 is the confidence coefficient for a 95% confidence level.
In practice, the person responsible for the hypothesis test specifies the level of significance. By selecting \({\alpha}\), that person is controlling the probability of making a Type I error.
Although most applications of hypothesis testing control for the probability of making a Type I error, they do not always control for the probability of making a Type II error. Hence, if we decide to accept \({H_0}\), we cannot determine how confident we can be with that decision. Because of the uncertainty associated with making a Type II error when conducting significance tests, statisticians usually recommend that we use the statement "do not reject \({H_0}\)" instead of “accept \({H_0}\).” Using the statement “do not reject \({H_0}\)” carries the recommendation to withhold both judgment and action. In effect, by not directly accepting \({H_0}\), the statistician avoids the risk of making a Type II error.
Refer figure 29.1
29.28 The probability of correctly rejecting \({H_0}\) when it is false is called the power of the test. For any particular value of \({\mu}\), the power is \(1 - \beta\).
25.4 The probability distribution for a random variable describes how probabilities are distributed over the values of the random variable.
The test statistic summarizes the observed data into a single number using the central tendency, variation, sample size, and number of predictor variables in the statistical model. Refer Table 29.1
| Test statistic | \({H_0}\) and \({H_a}\) | Statistical tests that use it |
|---|---|---|
| t-value | Null: The means of two groups are equal | T-test, Regression tests |
| Alternative: The means of two groups are not equal | ||
| z-value | Null: The means of two groups are equal | Z-test |
| Alternative:The means of two groups are not equal | ||
| F-value | Null: The variation among two or more groups is greater than or equal to the variation between the groups | ANOVA, ANCOVA, MANOVA |
| Alternative: The variation among two or more groups is smaller than the variation between the groups | ||
| \({\chi}^2\text{-value}\) | Null: Two samples are independent | Chi-squared test, Non-parametric correlation tests |
| Alternative: Two samples are not independent (i.e. they are correlated) |
23.17 A tail refers to the tapering sides at either end of a distribution curve.
One tailed-tests are concerned with one side of a statistic. Thus, one-tailed tests deal with only one tail of the distribution, and the z-score is on only one side of the statistic. Whereas, Two-tailed tests deal with both tails of the distribution, and the z-score is on both sides of the statistic.
In a one-tailed test, the area under the rejection region is equal to the level of significance, \({\alpha}\). When the rejection region is below the acceptance region, we say that it is a left-tail test. Similarly, when the rejection region is above the acceptance region, we say that it is a right-tail test.
In the two-tailed test, there are two critical regions, and the area under each region is \(\frac{\alpha}{2}\).
One-Tail vs. Two-Tail
One-tailed tests about a population mean take one of the following two forms:
29.4 \(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\mu} \geq {\mu}_0 \iff {H_a}: {\mu} < {\mu}_0\)
29.5 \(\text{\{Right or Upper\} } {H_0} : {\mu} \leq {\mu}_0 \iff {H_a}: {\mu} > {\mu}_0\)
Example: The label on a can of Hilltop Coffee states that the can contains 3 pounds of coffee. As long as the population mean filling weight is at least 3 pounds per can, the rights of consumers will be protected. Thus, the government (FTC) interprets the label information on a large can of coffee as a claim by Hilltop that the population mean filling weight is at least 3 pounds per can.
\[z = \frac{\overline{x} - {\mu}_0}{{\sigma}_{\overline{x}}} = \frac{\overline{x} - {\mu}_0}{{\sigma}/\sqrt{n}} \tag{29.1}\]
The key question for a lower tail test is, How small must the test statistic \({z}\) be before we choose to reject the null hypothesis
Two approaches can be used to answer this: the p-value approach and the critical value approach.
p-value (p) is the probability of obtaining a result equal to or more extreme than was observed in the data. It is the probability of observing the result given that the null hypothesis is true. A small p-value indicates the value of the test statistic is unusual given the assumption that \({H_0}\) is true.
For a lower tail test, the p-value is the probability of obtaining a value for the test statistic as small as or smaller than that provided by the sample. - we use the standard normal distribution to find the probability that \({z}\) is less than or equal to the value of the test statistic. - After computing the p-value, we must then decide whether it is small enough to reject the null hypothesis; this decision involves comparing the p-value to the level of significance.
For the Hilltop Coffee Example
Rejection Rule: Reject \({H_0}\) if p-value \(\leq {\alpha}\)
Further, in this case, we would reject \({H_0}\) for any value of \({\alpha} \geq (p = 0.0038)\). For this reason, the p-value is also called the observed level of significance.
For a lower tail test, the critical value serves as a benchmark for determining whether the value of the test statistic is small enough to reject the null hypothesis. - Critical value is the value of the test statistic that corresponds to an area of \({\alpha}\) (the level of significance) in the lower tail of the sampling distribution of the test statistic. - In other words, the critical value is the largest value of the test statistic that will result in the rejection of the null hypothesis.
Hilltop Coffee Example
Rejection Rule: Reject \({H_0}\) if \(z \leq z_{\alpha}\)
The p-value approach to hypothesis testing and the critical value approach will always lead to the same rejection decision; that is, whenever the p-value is less than or equal to \({\alpha}\), the value of the test statistic will be less than or equal to the critical value.
For upper tail test The test statistic \({z}\) is still computed as earlier. But, for an upper tail test, the p-value is the probability of obtaining a value for the test statistic as large as or larger than that provided by the sample. Thus, to compute the p-value for the upper tail test in the \({\sigma}\) known case, we must use the standard normal distribution to find the probability that \({z}\) is greater than or equal to the value of the test statistic. Using the critical value approach causes us to reject the null hypothesis if the value of the test statistic is greater than or equal to the critical value \(z_{\alpha}\); in other words, we reject \({H_0}\) if \(z \geq z_{\alpha}\).
28.1 Because a point estimator cannot be expected to provide the exact value of the population parameter, an interval estimate is often computed by adding and subtracting a value, called the margin of error (MOE), to the point estimate. \(\text{Interval Estimate} = \text{Point Estimate} \pm \text{MOE}_{\gamma}\)
\[Z = \frac {{\overline{x}} - {\mu}}{{\sigma}/{\sqrt{n}}} \quad \iff {\mu} = {\overline{x}} - Z \frac{{\sigma}}{\sqrt{n}} \quad \to {\mu} = {\overline{x}} \pm Z \frac{{\sigma}}{\sqrt{n}} \quad \to {\mu} \approx {\overline{x}} \pm Z \frac{{s}}{\sqrt{n}} \tag{29.2}\]
29.6 \(\text{\{Two Tail Test \} } \thinspace {H_0} :{\mu} = {\mu}_0 \iff {H_a}: {\mu} \neq {\mu}_0\)
Ex: Golf Company, mean driving distance is 295 yards i.e. \(({\mu}_0 = 295)\)
“ForLater” - This part needs to be moved to the Next Chapter.
Rejection Rule: Reject \({H_0}\) if \(z \leq -z_{\alpha/2}\) or \(z \geq z_{\alpha/2}\)
(Online, might be wrong) Ex: Assume that for a Population with mean \({\mu}\) unknown and standard deviation \({\sigma} = 15\), if we take a sample \({n = 100}\) its sample mean is \({\overline{x}} = 42\).
Assume \({\alpha} = 0.05\) and if we are conducting a Two Tail Test, \(Z_{\alpha/2 = 0.05/2} = 1.960\)
As shown in the equation (29.2), our interval range is \(\mu = \overline{X} \pm 2.94 = 42 \pm 2.94 \rightarrow \mu \in (39.06, 44.94)\)
We are 95% confident that the population mean will be between 39.04 and 44.94
Note that a 95% confidence interval does not mean there is a 95% chance that the true value being estimated is in the calculated interval. Rather, given a population, there is a 95% chance that choosing a random sample from this population results in a confidence interval which contains the true value being estimated.
Common Steps
p-Value Approach Step
Critical Value Approach
Refer equation (28.1), For the \({\sigma}\) known case, the \({(1 - \alpha)}\%\) confidence interval estimate of a population mean is given by
\[\overline{x} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}}\]
We know that \(100 {(1 - \alpha)}\%\) of the confidence intervals generated will contain the population mean and \(100 {\alpha}\%\) of the confidence intervals generated will not contain the population mean.
Thus, if we reject \({H_0}\) whenever the confidence interval does not contain \({\mu}_0\), we will be rejecting the null hypothesis when it is true \((\mu = {\mu}_0)\) with probability \({\alpha}\).
The level of significance is the probability of rejecting the null hypothesis when it is true. So constructing a \(100 {(1 - \alpha)}\%\) confidence interval and rejecting \({H_0}\) whenever the interval does not contain \({\mu}_0\) is equivalent to conducting a two-tailed hypothesis test with \({\alpha}\) as the level of significance.
Ex: Golf company
“ForLater” - Exercises
\[t = \frac{{\overline{x}} - {\mu}_0}{{s}/\sqrt{n}} \tag{29.3}\]
One-Tailed Test
Critical Value Approach - \((\text{DOF = 59}), \, t_{{\alpha} = 0.05} = 1.671\) - Because \((t = 1.84) > (t_{{\alpha} = 0.05} = 1.671)\), Reject \({H_0}\)
# #Like pnorm() is for P(z) and qnorm() is for z, pt() is for P(t) and qt() is for t.
#
# #p-value approach: Find Commulative Probability P corresponding to the given t-value & DOF=59
pt(q = 1.84, df = 59, lower.tail = FALSE)
## [1] 0.03539999
#
# #Critical Value: t-value for which Area under the curve towards Right is alpha=0.05 & DOF=59
qt(p = 0.05, df = 59, lower.tail = FALSE)
## [1] 1.671093Two Tailed Test
Critical Value Approach - \((\text{DOF = 24})\) - We find that \(P_{\left(t\right)} = 0.025\) for \(-t_{\alpha/2 = 0.025} = -2.064\) and \(t_{\alpha/2 = 0.025} = 2.064\) - Compare test statistic with z-value - Because \((t = -1.10)\) is NOT lower than \((-z_{\alpha/2 = 0.025} = -2.064)\), we cannot reject \({H_0}\)
Using \({p}_0\) to denote the hypothesized value for the population proportion, the three forms for a hypothesis test about a population proportion \({p}\) are :
Hypothesis tests about a population proportion are based on the difference between the sample proportion \({\overline{p}}\) and the hypothesized population proportion \({p}_0\)
The sampling distribution of \({\overline{p}}\), the point estimator of the population parameter \({p}\), is the basis for developing the test statistic.
When the null hypothesis is true as an equality, the expected value of \({\overline{p}}\) equals the hypothesized value \({p}_0\) i.e. \(E_{(\overline{p})} = {p}_0\)
The standard error of \({\overline{p}}\) is given in equation (29.4)
\[{\sigma}_{\overline{p}} = \sqrt{\frac{{p}_0 (1 - {p}_0)}{n}} \tag{29.4}\]
If \(np \geq 5\) and \(n(1 − p) \geq 5\), the sampling distribution of \({p}\) can be approximated by a normal distribution. Under these conditions, which usually apply in practice, the quantity \({z}\) as given in equation (29.5) has a standard normal probability distribution.
Test Statistic for Hypothesis Tests about a Population Proportion :
\[z = \frac{{\overline{p}} - {p}_0}{{\sigma}_{\overline{p}}} = \frac{{\overline{p}} - {p}_0}{\sqrt{\frac{{p}_0 (1 - {p}_0)}{n}}} \tag{29.5}\]
Example: Pine Creek: Determine whether the proportion of women golfers increased from \(p_0 = 0.20\)
29.26 \(\text{\{Right or Upper\} } {H_0} : {p} \leq {p}_0 \iff {H_a}: {p} > {p}_0\)
{0.25 - 0.20}/{sqrt(0.20 * {1 - 0.20} / 400)} \(\#\mathcal{R}\)pnorm(q = 2.50, lower.tail = FALSE) \(\#\mathcal{R}\)If the purpose of a hypothesis test is to make a decision when \({H_0}\) is true and a different decision when \({H_a}\) is true, the decision maker may want to, and in some cases be forced to, take action with both the conclusion do not reject \({H_0}\) and the conclusion reject \({H_0}\).
If this situation occurs, statisticians generally recommend controlling the probability of making a Type II error. With the probabilities of both the Type I and Type II error controlled, the conclusion from the hypothesis test is either to accept \({H_0}\) or reject \({H_0}\). In the first case, \({H_0}\) is concluded to be true, while in the second case, \({H_a}\) is concluded true. Thus, a decision and appropriate action can be taken when either conclusion is reached.
“ForLater” - Calculate \({\beta}\)
When the true population mean \({\mu}\) is close to the null hypothesis value of \({\mu} = 120\), the probability is high that we will make a Type II error. However, when the true population mean \({\mu}\) is far below the null hypothesis value of \({\mu} = 120\), the probability is low that we will make a Type II error.
Note that the power curve extends over the values of \({\mu}\) for which the null hypothesis is false. The height of the power curve at any value of \({\mu}\) indicates the probability of correctly rejecting \({H_0}\) when \({H_0}\) is false.
We can make 3 observations about the relationship among \({\alpha}, \beta, n (\text{sample size})\).
How interval estimates and hypothesis tests can be developed for situations involving two populations when the difference between the two population means or the two population proportions is of prime importance.
Example
Inferences About the Difference Between Two Population Means
Interval Estimation of \(({\mu}_1 - {\mu}_2)\)
The point estimator of the difference between the two population means \(({\mu}_1 - {\mu}_2)\) is the difference between the two sample means \(({\overline{x}}_1 - {\overline{x}}_2)\). Thus, \(E_{( {\overline{x}}_1 - {\overline{x}}_2 )}\) represents the difference of population means. It is given by equation (30.1)
Point Estimate :
\[E_{( {\overline{x}}_1 - {\overline{x}}_2 )} = {\overline{x}}_1 - {\overline{x}}_2 \tag{30.1}\]
As with other point estimators, the point estimator \(E_{( {\overline{x}}_1 - {\overline{x}}_2 )}\) has a standard error \({\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)}\), that describes the variation in the sampling distribution of the estimator. It is the standard deviation of the sampling distribution of \(({\overline{x}}_1 - {\overline{x}}_2)\). Refer equation (30.2)
Standard Error of \(({\overline{x}}_1 - {\overline{x}}_2)\) :
\[{\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)} = \sqrt{\frac{{\sigma}_1^2}{{n}_1} + \frac{{\sigma}_2^2}{{n}_2}} \tag{30.2}\]
28.1 Because a point estimator cannot be expected to provide the exact value of the population parameter, an interval estimate is often computed by adding and subtracting a value, called the margin of error (MOE), to the point estimate. \(\text{Interval Estimate} = \text{Point Estimate} \pm \text{MOE}_{\gamma}\)
In the case of estimation of the difference between two population means, an interval estimate will take the following form: \(E_{( {\overline{x}}_1 - {\overline{x}}_2 )} \, \pm \text{MOE}_{{\gamma}}\). Refer equation (30.3) and (30.4)
Margin of Error (\(\text{MOE}_{{\gamma}}\)) :
\[\text{MOE}_{{\gamma}} = {z}_{\frac{{\alpha}}{2}}{\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)} = {z}_{\frac{{\alpha}}{2}}\sqrt{\frac{{\sigma}_1^2}{{n}_1} + \frac{{\sigma}_2^2}{{n}_2}} \tag{30.3}\]
\(\text{Interval Estimate}_{\gamma}\) :
\[\text{Interval Estimate}_{\gamma} = ({\overline{x}}_1 - {\overline{x}}_2) \pm {z}_{\frac{{\alpha}}{2}} \sqrt{\frac{{\sigma}_1^2}{{n}_1} + \frac{{\sigma}_2^2}{{n}_2}} \tag{30.4}\]
Example: Greystone: Difference between the mean
30.4 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {\mu}_1 - {\mu}_2 = {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 \neq {D_0}\)
qnorm(p = 0.025, lower.tail = FALSE) \(\#\mathcal{R}\)sqrt(9^2/36 + 10^2/49) \(\#\mathcal{R}\)Using \({D_0}\) to denote the hypothesized difference between \({\mu}_1\) and \({\mu}_2\), the three forms for a hypothesis test are as follows:
The test statistic for the difference between two population means when \({\sigma}_1\) and \({\sigma}_2\) are known is given in equation (30.5)
Test Statistic for Hypothesis Tests :
\[z = \frac{({\overline{x}}_1 - {\overline{x}}_2) - {D}_0}{\sqrt{\frac{{\sigma}_1^2}{{n}_1} + \frac{{\sigma}_2^2}{{n}_2}}} \tag{30.5}\]
Example: Evaluate differences in education quality between two training centers
30.4 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {\mu}_1 - {\mu}_2 = {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 \neq {D_0}\)
sqrt(10^2/30 + 10^2/40) \(\#\mathcal{R}\)2 * pnorm(q = 1.66, lower.tail = FALSE) \(\#\mathcal{R}\)# #Get P(z) for z = 1.66 (Two-Tail)
#
# #Get the default (lower), subtract from 1, Double if Two-Tail
ii <- 2 * {1 - pnorm(q = 1.66)}
jj <- 2 * {1 - pnorm(q = 1.66, lower.tail = TRUE)}
#
# #Use the symmetry i.e. 'minus z' value, Double if Two-Tail
kk <- 2 * pnorm(q = -1.66)
#
# #Use the actual Upper Tail Option, Double if Two-Tail
ll <- 2 * pnorm(q = 1.66, lower.tail = FALSE)
#
stopifnot(all(identical(round(ii, 7), round(jj, 7)), identical(round(ii, 7), round(kk, 7)),
identical(round(ii, 7), round(ll, 7))))
ll
## [1] 0.09691445The null-hypothesis of this test is that the population is normally distributed. Thus, if the p value is less than the chosen alpha level, then the null hypothesis is rejected and there is evidence that the data being tested is not distributed normally.
On the other hand, if the p value is greater than the chosen alpha level, then the null hypothesis (that the data came from a normally distributed population) cannot be rejected (e.g., for an alpha level of .05, a data set with a p value of less than .05 rejects the null hypothesis that the data are from a normally distributed population).
Shapiro–Wilk test is known not to work well in samples with many identical values.
30.4 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {\mu}_1 - {\mu}_2 = {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 \neq {D_0}\)
sqrt(2.2^2/50 + 3^2/35) \(\#\mathcal{R}\)30.3 \(\text{\{Right or Upper\} } {H_0} : {\mu}_1 - {\mu}_2 \leq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 > {D_0}\)
sqrt(5.2^2/40 + 6^2/50) \(\#\mathcal{R}\)pnorm(q = 2.03, lower.tail = FALSE) \(\#\mathcal{R}\)30.4 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {\mu}_1 - {\mu}_2 = {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 \neq {D_0}\)
sqrt(4.55^2/37 + 3.97^2/44) \(\#\mathcal{R}\)qnorm(p = 0.025, lower.tail = FALSE) \(\#\mathcal{R}\)pnorm(q = 4.13369, lower.tail = FALSE) \(\#\mathcal{R}\)30.2 \(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\mu}_1 - {\mu}_2 \geq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 < {D_0}\)
sqrt(6^2/60 + 6^2/60) \(\#\mathcal{R}\)pnorm(q = -2.738613, lower.tail = TRUE) \(\#\mathcal{R}\)30.3 \(\text{\{Right or Upper\} } {H_0} : {\mu}_1 - {\mu}_2 \leq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 > {D_0}\)
pnorm(q = 0.2739, lower.tail = FALSE) \(\#\mathcal{R}\)30.2 \(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\mu}_1 - {\mu}_2 \geq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 < {D_0}\)
pnorm(q = -1.826, lower.tail = TRUE) \(\#\mathcal{R}\)qnorm(p = 0.05, lower.tail = TRUE) \(\#\mathcal{R}\)Use the sample standard deviations, \({s}_1\) and \({s}_2\), to estimate the unknown population standard deviations \(({\sigma}_1, {\sigma}_2)\).
When \(({\sigma}_1, {\sigma}_2)\) are estimated by \(({s}_1, {s}_2)\), the t distribution is used to make inferences about the difference between two population means.
Interval Estimation of \(({\mu}_1 - {\mu}_2)\)
Margin of Error (\(\text{MOE}_{{\gamma}}\)) : Refer (30.6) like (30.3)
\[\text{MOE}_{{\gamma}} = {t}_{\frac{{\alpha}}{2}}{\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)} = {t}_{\frac{{\alpha}}{2}}\sqrt{\frac{{\sigma}_1^2}{{n}_1} + \frac{{\sigma}_2^2}{{n}_2}} \tag{30.6}\]
\(\text{Interval Estimate}_{\gamma}\) : Refer (30.7) like (30.4) using (30.2)
\[\text{Interval Estimate}_{\gamma} = ({\overline{x}}_1 - {\overline{x}}_2) \pm {t}_{\frac{{\alpha}}{2}}{\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)} = ({\overline{x}}_1 - {\overline{x}}_2) \pm {t}_{\frac{{\alpha}}{2}}\sqrt{\frac{{s}_1^2}{{n}_1} + \frac{{s}_2^2}{{n}_2}} \tag{30.7}\]
Degrees of Freedom (DOF) : Refer (30.8)
\[\text{DOF} = \frac{ { \left( \frac{s_1^2}{n_1} + \frac{s_2^2}{n_2} \right) }^2} {\frac{1}{n_1 - 1}{ \left( \frac{s_1^2}{n_1} \right) }^2 + \frac{1}{n_2 - 1}{ \left( \frac{s_2^2}{n_2} \right) }^2} \tag{30.8}\]
Example: Clearwater - To estimate the difference between the mean
30.4 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {\mu}_1 - {\mu}_2 = {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 \neq {D_0}\)
sqrt(150^2/28 + 125^2/22) \(\#\mathcal{R}\)floor({150^2 / 28 + 125^2 / 22 }^2 / {{150^2 / 28}^2/{28-1} + {125^2 / 22}^2/{22-1}}) \(\#\mathcal{R}\)qt(p = 0.025, df = 47, lower.tail = FALSE) \(\#\mathcal{R}\)Hypothesis Tests
Test Statistic for Hypothesis Tests : Refer (30.9) like (30.5) using (30.2)
\[t = \frac{({\overline{x}}_1 - {\overline{x}}_2) - {D}_0}{{\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)}} = \frac{({\overline{x}}_1 - {\overline{x}}_2) - {D}_0}{\sqrt{\frac{{s}_1^2}{{n}_1} + \frac{{s}_2^2}{{n}_2}}} \tag{30.9}\]
Caution: Use of \({\sigma}_{({\overline{x}}_1 - {\overline{x}}_2)}\) symbol is probably wrong here because it should not represent formula containing \({s}_1, {s}_2\). “ForLater”
Software: To show that the new software will provide a shorter mean time
30.3 \(\text{\{Right or Upper\} } {H_0} : {\mu}_1 - {\mu}_2 \leq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 > {D_0}\)
sqrt(40^2/12 + 44^2/12) \(\#\mathcal{R}\)floor({40^2 / 12 + 44^2 / 12 }^2 / {{40^2 / 12}^2/{12-1} + {44^2 / 12}^2/{12-1}}) \(\#\mathcal{R}\)pt(q = 2.272, df = 21, lower.tail = FALSE) \(\#\mathcal{R}\)CONVERT HERE to Tilde Based Option, to take advantage of Column Headers
# #Software
xxSoftware <- tibble(Old = c(300, 280, 344, 385, 372, 360, 288, 321, 376, 290, 301, 283),
New = c(274, 220, 308, 336, 198, 300, 315, 258, 318, 310, 332, 263))
aa <- xxSoftware
# Summary
aa %>%
pivot_longer(everything(), names_to = "key", values_to = "value") %>%
group_by(key) %>%
summarise(across(value, list(Count = length, Mean = mean, SD = sd), .names = "{.fn}"))
## # A tibble: 2 x 4
## key Count Mean SD
## <chr> <int> <dbl> <dbl>
## 1 New 12 286 44.0
## 2 Old 12 325 40.0
bb <- aa %>% pivot_longer(everything(), names_to = "key", values_to = "value")
#
# #Welch Two Sample t-test
# #Alternative must be: "two.sided" (Default), "less", "greater"
bb_ha <- "greater"
#bb_testT <- t.test(x = bb$Old, y = bb$New, alternative = bb_ha)
bb_testT <- t.test(formula = value ~ key, data = bb, alternative = bb_ha)
bb_testT
##
## Welch Two Sample t-test
##
## data: value by key
## t = -2.2721, df = 21.803, p-value = 0.9833
## alternative hypothesis: true difference in means between group New and group Old is greater than 0
## 95 percent confidence interval:
## -68.48569 Inf
## sample estimates:
## mean in group New mean in group Old
## 286 325
#
cat(paste0("t is the t-test statistic value (t = ", round(bb_testT$statistic, 6), ")\n"))
## t is the t-test statistic value (t = -2.272127)
cat(paste0("df is the degrees of freedom (df = ", round(bb_testT$parameter, 1), ")\n"))
## df is the degrees of freedom (df = 21.8)
cat(paste0("p-value is the significance level of the t-test (p-value = ",
round(bb_testT$p.value, 6), ")\n"))
## p-value is the significance level of the t-test (p-value = 0.98335)
cat(paste0("conf.int is the confidence interval of the mean at 95% (conf.int = [",
paste0(round(bb_testT$conf.int, 3), collapse = ", "), "])\n"))
## conf.int is the confidence interval of the mean at 95% (conf.int = [-68.486, Inf])
cat(paste0("sample estimates is the mean value of the samples. Mean: ",
paste0(round(bb_testT$estimate, 2), collapse = ", "), "\n"))
## sample estimates is the mean value of the samples. Mean: 286, 325
#
# #Compare p-value with alpha = 0.05
alpha <- 0.05
if(any(all(bb_ha == "two.sided", bb_testT$p.value >= alpha / 2),
all(bb_ha != "two.sided", bb_testT$p.value >= alpha))) {
cat(paste0("p-value (", round(bb_testT$p.value, 6), ") is greater than alpha (", alpha,
"). We failed to reject H0. We cannot conclude that the populations are different.\n"))
} else {
cat(paste0("p-value (", round(bb_testT$p.value, 6), ") is less than alpha (", alpha,
").\nWe can reject the H0 with 95% confidence. The populations are different.\n"))
}
## p-value (0.98335) is greater than alpha (0.05). We failed to reject H0. We cannot conclude that the populations are different.Another approach used to make inferences about the difference between two population means when \({\sigma}_1\) and \({\sigma}_2\) are unknown is based on the assumption that the two population standard deviations are equal \(({\sigma}_1 = {\sigma}_2 = {\sigma})\). Under this assumption, the two sample standard deviations are combined to provide the pooled sample variance as given in equation (30.10).
\[{s}_p^2 = \frac{({n}_1 - 1){s}_1^2 + ({n}_2 - 1){s}_2^2}{{n}_1 + {n}_2 - 2} \tag{30.10}\]
The t test statistic becomes (30.11) with \(({n}_1 + {n}_2 - 2)\) degrees of freedom.
\[t = \frac{({\overline{x}}_1 - {\overline{x}}_2) - {D}_0}{{s}_p\sqrt{\frac{1}{{n}_1} + \frac{1}{{n}_2}}} \tag{30.11}\]
Then the computation of the p-value and the interpretation of the sample results are same as earlier.
(External) Unpaired Two-Samples T-test
# #Pooled Sample Variance
# #Evaluate differences in education quality between two training centers
# #Generate Data for Two Centers
set.seed(3)
setA <- rnorm(n = 30, mean = 82, sd = 10)
setB <- rnorm(n = 40, mean = 78, sd = 10)
bb <- tibble(sets = c(rep("setA", length(setA)), rep("setB", length(setB))), values = c(setA, setB))
# Summary
bb %>% group_by(sets) %>% summarise(Count = n(), Mean = mean(values), SD = sd(values))
## # A tibble: 2 x 4
## sets Count Mean SD
## <chr> <int> <dbl> <dbl>
## 1 setA 30 79.7 8.10
## 2 setB 40 78.8 9.43
#
# #Assumption 1: Are the two samples independents
# #YES
#
# #Assumtion 2: Are the data from each of the 2 groups follow a normal distribution
# #Shapiro-Wilk normality test
isNormal_A <- with(bb, shapiro.test(values[sets == "setA"]))
isNormal_B <- with(bb, shapiro.test(values[sets == "setB"]))
#isNormal_A
#isNormal_A$statistic
# #p-value > 0.05 is needed for Normality
isNormal_A$p.value
## [1] 0.1032747
isNormal_B$p.value
## [1] 0.15595
#
# #Both have p-values greater than the significance level alpha = 0.05.
# #implying that the distribution of the data are not significantly different from the normal
# #In other words, we can assume the normality.
#
bb %>%
group_by(sets) %>%
summarise(p = shapiro.test(values)$p.value)
## # A tibble: 2 x 2
## sets p
## <chr> <dbl>
## 1 setA 0.103
## 2 setB 0.156
#
# #Assumption 3. Do the two populations have the same variances
# #We will use F-test to test for homogeneity in variances. (p-value > 0.05 is needed)
#
bb_testF <- var.test(values ~ sets, data = bb)
# bb_testF
bb_testF$p.value
## [1] 0.3990791
#
# #The p-value of F-test is greater than the significance level alpha = 0.05.
# #In conclusion, there is no significant difference between the variances of the two sets of data.
# #Therefore, we can use the classic t-test witch assume equality of the two variances.
#
# #Compute unpaired two-samples t-test
#
# #Question : Is there any significant difference between the mean of two populations
#
bb_testT <- t.test(formula = values ~ sets, data = bb, var.equal = TRUE)
bb_testT
##
## Two Sample t-test
##
## data: values by sets
## t = 0.41947, df = 68, p-value = 0.6762
## alternative hypothesis: true difference in means between group setA and group setB is not equal to 0
## 95 percent confidence interval:
## -3.383982 5.185373
## sample estimates:
## mean in group setA mean in group setB
## 79.66783 78.76714
#
cat(paste0("t is the t-test statistic value (t = ", round(bb_testT$statistic, 6), ")\n"))
## t is the t-test statistic value (t = 0.419474)
cat(paste0("df (n1 + n2 - 2) is the degrees of freedom (df = ", bb_testT$parameter, ")\n"))
## df (n1 + n2 - 2) is the degrees of freedom (df = 68)
cat(paste0("p-value is the significance level of the t-test (p-value = ",
round(bb_testT$p.value, 6), ")\n"))
## p-value is the significance level of the t-test (p-value = 0.676192)
cat(paste0("conf.int is the confidence interval of the mean at 95% (conf.int = [",
paste0(round(bb_testT$conf.int, 3), collapse = ", "), "])\n"))
## conf.int is the confidence interval of the mean at 95% (conf.int = [-3.384, 5.185])
cat(paste0("sample estimates is the mean value of the samples. Mean: ",
paste0(round(bb_testT$estimate, 2), collapse = ", "), "\n"))
## sample estimates is the mean value of the samples. Mean: 79.67, 78.77
#
# #Compare p-value with alpha = 0.05
alpha <- 0.05
if(bb_testT$p.value >= alpha) {
cat(paste0("p-value (", round(bb_testT$p.value, 6), ") is greater than alpha (", alpha,
"). We failed to reject H0. We cannot conclude that the populations are different.\n"))
} else {
cat(paste0("p-value (", round(bb_testT$p.value, 6), ") is less than alpha (", alpha,
")\n. We can reject the H0 with 95% confidence. The populations are different.\n"))
}
## p-value (0.676192) is greater than alpha (0.05). We failed to reject H0. We cannot conclude that the populations are different.bb <- xxSoftware
str(bb)
## tibble [12 x 2] (S3: tbl_df/tbl/data.frame)
## $ Old: num [1:12] 300 280 344 385 372 360 288 321 376 290 ...
## $ New: num [1:12] 274 220 308 336 198 300 315 258 318 310 ...
#
# #Applying Multiple Functions with Summarise but Output as Cross-Table
# #(Original) Columns as Rows, Functions as Columns
# #NOTE: n() can be applied as lambda function
bb %>%
pivot_longer(everything(), names_to = "key", values_to = "value") %>%
group_by(key) %>%
summarise(across(value, list(N = ~n(), Count = length, Mean = mean, SD = sd), .names = "{.fn}"))
## # A tibble: 2 x 5
## key N Count Mean SD
## <chr> <int> <int> <dbl> <dbl>
## 1 New 12 12 286 44.0
## 2 Old 12 12 325 40.0str(bb)
## tibble [12 x 2] (S3: tbl_df/tbl/data.frame)
## $ Old: num [1:12] 300 280 344 385 372 360 288 321 376 290 ...
## $ New: num [1:12] 274 220 308 336 198 300 315 258 318 310 ...
#
# #gather() is deprecated. Here is for reference.
# #Longer Tibble is filled with All Values of Col A, then All Values fo Col B and so on
ii <- gather(bb)
jj <- bb %>% gather("key", "value")
kk <- bb %>% gather("key", "value", everything())
#
# #pivot_longer()
# #Longer Tibble is filled with First Row of All Columns, then 2nd Row of All Columns and so on
ll <- bb %>% pivot_longer(everything(), names_to = "key", values_to = "value") %>% arrange(desc(key))
stopifnot(all(identical(ii, jj), identical(ii, kk), identical(ii, ll)))str(bb)
## tibble [12 x 2] (S3: tbl_df/tbl/data.frame)
## $ Old: num [1:12] 300 280 344 385 372 360 288 321 376 290 ...
## $ New: num [1:12] 274 220 308 336 198 300 315 258 318 310 ...
# #Store a Grouped Tibble
ii <- bb %>%
pivot_longer(everything(), names_to = "key", values_to = "value") %>%
group_by(key)
str(ii)
## grouped_df [24 x 2] (S3: grouped_df/tbl_df/tbl/data.frame)
## $ key : chr [1:24] "Old" "New" "Old" "New" ...
## $ value: num [1:24] 300 274 280 220 344 308 385 336 372 198 ...
## - attr(*, "groups")= tibble [2 x 2] (S3: tbl_df/tbl/data.frame)
## ..$ key : chr [1:2] "New" "Old"
## ..$ .rows: list<int> [1:2]
## .. ..$ : int [1:12] 2 4 6 8 10 12 14 16 18 20 ...
## .. ..$ : int [1:12] 1 3 5 7 9 11 13 15 17 19 ...
## .. ..@ ptype: int(0)
## ..- attr(*, ".drop")= logi TRUE
ii %>% summarise(across(value, list(N = ~n(), Count = length, Mean = mean, SD = sd),
.names = "{.fn}"))
## # A tibble: 2 x 5
## key N Count Mean SD
## <chr> <int> <int> <dbl> <dbl>
## 1 New 12 12 286 44.0
## 2 Old 12 12 325 40.0
#
# #Equivalent (except Column Headers)
ii %>% summarise(N = n(), Count = across(value, length),
Mean = across(value, mean), SD = across(value, sd))
## # A tibble: 2 x 5
## key N Count$value Mean$value SD$value
## <chr> <int> <int> <dbl> <dbl>
## 1 New 12 12 286 44.0
## 2 Old 12 12 325 40.0“ForLater”
sqrt(2.5^2/20 + 4.8^2/30) \(\#\mathcal{R}\)floor({2.5^2 / 20 + 4.8^2 / 30 }^2 / {{2.5^2 / 20}^2/{20-1} + {4.8^2 / 30}^2/{30-1}}) \(\#\mathcal{R}\)qt(p = 0.025, df = 45, lower.tail = FALSE) \(\#\mathcal{R}\)30.4 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {\mu}_1 - {\mu}_2 = {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 \neq {D_0}\)
sqrt(5.2^2/35 + 8.5^2/40) \(\#\mathcal{R}\)floor({5.2^2 / 35 + 8.5^2 / 40 }^2 / {{5.2^2 / 35}^2/{35-1} + {8.5^2 / 40}^2/{40-1}}) \(\#\mathcal{R}\)pt(q = 2.179, df = 65, lower.tail = FALSE) \(\#\mathcal{R}\)Suppose employees at a manufacturing company can use two different methods to perform a production task. To maximize production output, the company wants to identify the method with the smaller population mean completion time. We can use two alternative designs for the sampling procedure.
In the matched sample design the two production methods are tested under similar conditions (i.e., with the same workers); hence this design often leads to a smaller sampling error than the independent sample design.
The primary reason is that in a matched sample design, variation between workers is eliminated because the same workers are used for both production methods.
The key to the analysis of the matched sample design is to realize that we consider only the column of differences.
The matched sample design is generally preferred to the independent sample design because the matched-sample procedure often improves the precision of the estimate.
Let \({\mu}_d\) = the mean of the difference in values for the population
Sample Mean is given by (30.12) like (23.6) and Sample Standard Deviation is given by (30.13) like (23.12)
\[\overline{d} = \frac{\sum{{d}_i}}{n} \tag{30.12}\]
\[{s}_d = \sqrt{\frac{\sum ({d}_i - \overline{d})^2}{n-1}} \tag{30.13}\]
Test Statistic with \((n-1)\) degrees of freedom : Refer (30.14) like (29.3)
\[t = \frac{\overline{d} - {\mu}_d}{{s}_d/\sqrt{n}} \tag{30.14}\]
Margin of Error (\(\text{MOE}_{{\gamma}}\)) : Refer (30.15) like (30.3)
\[\text{MOE}_{{\gamma}} = {t}_{\frac{{\alpha}}{2}}\frac{{s}_d}{\sqrt{n}} \tag{30.15}\]
\(\text{Interval Estimate}_{\gamma}\) : Refer (30.16) like (30.4)
\[\text{Interval Estimate}_{\gamma} = \overline{d} \pm {t}_{\frac{{\alpha}}{2}}\frac{{s}_d}{\sqrt{n}} \tag{30.16}\]
“ForLater” - Exercise
Example: Two Production Methods
{mean(bb$d) - 0 } / {sd(bb$d) / sqrt(length(bb$d))} \(\#\mathcal{R}\)2 * pt(q = 2.196, df = 5, lower.tail = FALSE) \(\#\mathcal{R}\)qt(p = 0.025, df = 5, lower.tail = FALSE) \(\#\mathcal{R}\)# #Matched Samples: Same workers providing data for two methods
xxMatchedMethods <- tibble(M1 = c(6, 5, 7, 6.2, 6, 6.4),
M2 = c(5.4, 5.2, 6.5, 5.9, 6, 5.8))
aa <- xxMatchedMethods
# #Get Differnce
bb <- aa %>% mutate(d = M1-M2)
str(bb)
## tibble [6 x 3] (S3: tbl_df/tbl/data.frame)
## $ M1: num [1:6] 6 5 7 6.2 6 6.4
## $ M2: num [1:6] 5.4 5.2 6.5 5.9 6 5.8
## $ d : num [1:6] 0.6 -0.2 0.5 0.3 0 ...
paste0(round(bb[3], 1))
## [1] "c(0.6, -0.2, 0.5, 0.3, 0, 0.6)"
#
cat(paste0("- ${n} = ", length(bb$d), ", {\\overline{d}} = ", round(mean(bb$d), 1),
", {{s}_d} = ", round(sd(bb$d), 3), "$\n"))
## - ${n} = 6, {\overline{d}} = 0.3, {{s}_d} = 0.335$
cat(paste0("t = ", round({mean(bb$d) - 0 } / {sd(bb$d) / sqrt(length(bb$d))}, 3), "\n"))
## t = 2.196
#
# #Paired t-test
bb <- aa %>% pivot_longer(everything(), names_to = "key", values_to = "value")
#
# #Welch Two Sample t-test
# #Alternative must be: "two.sided" (Default), "less", "greater"
bb_ha <- "two.sided"
bb_testT <- t.test(formula = value ~ key, data = bb, alternative = bb_ha, paired = TRUE)
bb_testT
##
## Paired t-test
##
## data: value by key
## t = 2.1958, df = 5, p-value = 0.07952
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.05120834 0.65120834
## sample estimates:
## mean of the differences
## 0.3To make an inference about the difference between the two population proportions \(({p}_1 - {p}_2)\), we select a simple random sample of \({n}_1\) units from population 1 and a second simple random sample of \({n}_2\) units from population 2. Let \(\overline{{p}}_1, \overline{{p}}_2\) denote the sample proportions for simple random sample from populations 1 and 2.
\(({x})\) denotes Count of Success
Interval Estimation of \(({p}_1 - {p}_2)\)
The point estimator of the difference between two population proportions is the difference between the sample proportions of two independent simple random samples.
Point Estimate : Refer (30.17) like (30.1)
\[E_{( {\overline{p}}_1 - {\overline{p}}_2 )} = {\overline{p}}_1 - {\overline{p}}_2 \tag{30.17}\]
As with other point estimators, the point estimator \(E_{( {\overline{p}}_1 - {\overline{p}}_2 )}\) has a standard error \({\sigma}_{({\overline{p}}_1 - {\overline{p}}_2)}\), that describes the variation in the sampling distribution of the estimator.
The two population proportions, \(({p}_1, {p}_2)\), are unknown. Thus, sample proportions \(({\overline{p}}_1, {\overline{p}}_2)\) are being used to estimate them.
Standard Error of \(({\overline{p}}_1 - {\overline{p}}_2)\) : Refer equation (30.18) like (30.2)
\[\begin{align} {\sigma}_{({\overline{p}}_1 - {\overline{p}}_2)} &= \sqrt{\frac{{p}_1 (1-{p}_1)}{{n}_1} + \frac{{p}_2 (1-{p}_2)}{{n}_2}} \\ &= \sqrt{\frac{{\overline{p}}_1 (1-{\overline{p}}_1)}{{n}_1} + \frac{{\overline{p}}_2 (1-{\overline{p}}_2)}{{n}_2}} \end{align} \tag{30.18}\]
Margin of Error (\(\text{MOE}_{{\gamma}}\)) : Refer equation (30.19) like (30.3)
\[\text{MOE}_{{\gamma}} = {z}_{\frac{{\alpha}}{2}}{\sigma}_{({\overline{p}}_1 - {\overline{p}}_2)} = {z}_{\frac{{\alpha}}{2}}\sqrt{\frac{{\overline{p}}_1 (1-{\overline{p}}_1)}{{n}_1} + \frac{{\overline{p}}_2 (1-{\overline{p}}_2)}{{n}_2}} \tag{30.19}\]
\(\text{Interval Estimate}_{\gamma}\) : Refer equation (30.20) like (30.4)
\[\text{Interval Estimate}_{\gamma} = ({\overline{p}}_1 - {\overline{p}}_2) \pm {z}_{\frac{{\alpha}}{2}}\sqrt{\frac{{\overline{p}}_1 (1-{\overline{p}}_1)}{{n}_1} + \frac{{\overline{p}}_2 (1-{\overline{p}}_2)}{{n}_2}} \tag{30.20}\]
Example: Tax Preparation: (Count of Success \(({x})\) is Number of Returns with Errors)
qnorm(p = 0.05, lower.tail = FALSE) \(\#\mathcal{R}\)sqrt(0.14 * (1 - 0.14) / 250 + 0.09 * (1 - 0.09) / 300) \(\#\mathcal{R}\)When we assume \({H_0}\) is true as an equality, we have \({p}_1 - {p}_2 = 0\), which is the same as saying that the population proportions are equal, \({p}_1 = {p}_2 = {p}\). The equation (30.18) becomes (30.21)
Standard Error of \(({\overline{p}}_1 - {\overline{p}}_2)\) : When \({p}_1 = {p}_2 = {p}\)
\[\begin{align} {\sigma}_{({\overline{p}}_1 - {\overline{p}}_2)} &= \sqrt{{p} (1-{p})\left(\frac{1}{{n}_1} + \frac{1}{{n}_2}\right)} \\ &= \sqrt{{\overline{p}} (1-{\overline{p}})\left(\frac{1}{{n}_1} + \frac{1}{{n}_2}\right)} \end{align} \tag{30.21}\]
With \({p}\) unknown, we pool, or combine, the point estimators from the two samples \(({p}_1, {p}_2)\) to obtain a single point estimator of \({p}\) is given by (30.22)
Pooled Estimator of \({p}\) : When \({p}_1 = {p}_2 = {p}\)
\[{\overline{p}} = \frac{{n}_1 {\overline{p}}_1 + {n}_2 {\overline{p}}_2}{{n}_1 + {n}_2} \tag{30.22}\]
Test Statistic for Hypothesis Tests :
\[z = \frac{({\overline{p}}_1 - {\overline{p}}_2)}{{\sigma}_{({\overline{p}}_1 - {\overline{p}}_2)}} = \frac{({\overline{p}}_1 - {\overline{p}}_2)}{\sqrt{{\overline{p}} (1-{\overline{p}})\left(\frac{1}{{n}_1} + \frac{1}{{n}_2}\right)}} \tag{30.23}\]
Example: Tax Preparation: Continuation
30.11 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {p}_1 - {p}_2 = 0 \iff {H_a}: {p}_1 - {p}_2 \neq 0\)
{250 * 0.14 + 300 * 0.09} / {250 + 300} \(\#\mathcal{R}\)2 * pnorm(q = 1.845, lower.tail = FALSE) \(\#\mathcal{R}\)(External) Two-Proportions Z-Test
bb_ha <- "two.sided"
bb_gamma <- 0.90
bb_propT <- prop.test(x = c(35, 27), n = c(250, 300), alternative = bb_ha,
conf.level = bb_gamma, correct = FALSE)
bb_propT
##
## 2-sample test for equality of proportions without continuity correction
##
## data: c(35, 27) out of c(250, 300)
## X-squared = 3.4084, df = 1, p-value = 0.06486
## alternative hypothesis: two.sided
## 90 percent confidence interval:
## 0.004815898 0.095184102
## sample estimates:
## prop 1 prop 2
## 0.14 0.09
names(bb_propT)
## [1] "statistic" "parameter" "p.value" "estimate" "null.value" "conf.int"
## [7] "alternative" "method" "data.name"
#
# #X-squared might be the square of calculated z-value
cat(paste0("X-squared is the number of Successes (X-squared = ", round(bb_propT$statistic, 6), ")\n"))
## X-squared is the number of Successes (X-squared = 3.408415)
#
# #By default, the function prop.test() uses the Yates continuity correction
# #It is important if either the expected successes or failures is < 5.
# #If you do not want the correction, use the additional argument correct = FALSE.
# #i.e. To make the test mathematically equivalent to the uncorrected z-test of a proportion.
cat(paste0("p-value is the significance level of the t-test (p-value = ",
round(bb_propT$p.value, 6), ")\n"))
## p-value is the significance level of the t-test (p-value = 0.064865)
cat(paste0("conf.int is the confidence interval for the probability of success at", bb_gamma,
" level (conf.int = [", paste0(round(bb_propT$conf.int, 3), collapse = ", "), "])\n"))
## conf.int is the confidence interval for the probability of success at0.9 level (conf.int = [0.005, 0.095])
cat(paste0("sample estimates is the the estimated probability of success. p: ",
paste0(round(bb_propT$estimate, 2), collapse = ", "), "\n"))
## sample estimates is the the estimated probability of success. p: 0.14, 0.09In many manufacturing applications, controlling the process variance is extremely important in maintaining quality.
The sample variance \({s^2}\), given by equation (31.1), is the point estimator of the population variance \({\sigma}^2\).
\[{s^2} = \frac{\sum {({x}_i - {\overline{x}})}^2}{n-1} \tag{31.1}\]
Note:
Example: A sample of 20 containers \({n = 20}\) has the sample variance \({s^2} = 0.0025\)
qchisq(p = 0.025, df = 19, lower.tail = FALSE) \(\#\mathcal{R}\)qchisq(p = 0.975, df = 19, lower.tail = FALSE) \(\#\mathcal{R}\)# #pnorm() qnorm() | pt() qt() | pchisq() qchisq() | pf() qf()
#
# #p-value approach: Find Commulative Probability P corresponding to the given ChiSq & DOF=59
pchisq(q = 32.852, df = 19, lower.tail = FALSE)
## [1] 0.02500216
#
# #ChiSq value for which Area under the curve towards Right is alpha=0.025 & DOF=19 #32.852
qchisq(p = 0.025, df = 19, lower.tail = FALSE)
## [1] 32.85233\[\frac{(n-1)s^2}{{\chi_{0.025}^2}} \leq {\sigma}^2 \leq \frac{(n-1)s^2}{{\chi_{0.975}^2}} \tag{31.2}\]
Generalising the equation (31.2), the equation (31.3) is the interval estimate of a population variance.
\[\frac{(n-1)s^2}{{\chi_{{\alpha}/2}^2}} \leq {\sigma}^2 \leq \frac{(n-1)s^2}{{\chi_{{1-\alpha}/2}^2}} \tag{31.3}\]
where the \({\chi^2}\) values are based on a chi-square distribution with \({n-1}\) degrees of freedom and where \((1 − {\alpha})\) is the confidence coefficient.
Using \({{\sigma}_0^2}\) to denote the hypothesized value for the population variance, the three forms for a hypothesis test are as follows:
Note: In general, Upper Tail test is the most frequently observed test because low variances are generally desirable. With a statement about the maximum allowable population variance, we can test the null hypothesis that the population variance is less than or equal to the maximum allowable value against the alternative hypothesis that the population variance is greater than the maximum allowable value. With this test structure, corrective action will be taken whenever rejection of the null hypothesis indicates the presence of an excessive population variance.
Test Statistic for Hypothesis Tests About a Population Variance: Refer (31.4), where \({\chi^2}\) has a chi-square distribution with \({n - 1}\) degrees of freedom.
\[{\chi^2} = \frac{(n - 1){s}^2}{{\sigma}_0^2} \tag{31.4}\]
Example: Louis: the company standard specifies an arrival time variance of 4 or less
31.3 \(\text{\{Right or Upper\} } {H_0} : {\sigma}^2 \leq {{\sigma}_0^2} \iff {H_a}: {\sigma}^2 > {{\sigma}_0^2}\)
pchisq(q = 28.18, df = 23, lower.tail = FALSE) \(\#\mathcal{R}\)Example: bureau of motor vehicles: Evaluate the variance in the new examination test scores with the historical value \({\sigma}_0^2 = 100\)
31.4 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {\sigma}^2 = {{\sigma}_0^2} \iff {H_a}: {\sigma}^2 \neq {{\sigma}_0^2}\)
2 * pchisq(q = 46.98, df = 29, lower.tail = FALSE) \(\#\mathcal{R}\)The two sample variances \({s}_1^2\) and \({s}_2^2\) will be the basis for making inferences about the two population variances \({\sigma}_1^2\) and \({\sigma}_2^2\).
Note:
Test Statistic for Hypothesis Tests About Population Variances with \(({\sigma}_1^2 = {\sigma}_2^2)\) : Refer equation (31.5)
\[F = \frac{{s}_1^2}{{s}_2^2} \tag{31.5}\]
Hypothesis Tests :
Example: Dullus County Schools:
31.8 \(\text{\{Two Tail Test \} } \thinspace {H_0} : {\sigma}_1^2 = {\sigma}_2^2 \iff {H_a}: {\sigma}_1^2 \neq {\sigma}_2^2\)
2 * pf(q = 2.4, df1 = 25, df2 = 15, lower.tail = FALSE) \(\#\mathcal{R}\)Example: public opinion survey: do women show a greater variation in attitude on political issues than men
31.7 \(\text{\{Right or Upper\} } {H_0} : {\sigma}_1^2 \leq {\sigma}_2^2 \iff {H_a}: {\sigma}_1^2 > {\sigma}_2^2\)
pf(q = 1.5, df1 = 40, df2 = 30, lower.tail = FALSE) \(\#\mathcal{R}\)Hypothesis-testing procedures that expand our capacity for making statistical inferences about populations
where \({p}_j\) is population proportion of the \(j^{\text{th}}\) population. We assume that a simple random sample of size \({n}_j\) has been selected from each of the \({k}\) populations or treatments.
Select a random sample from each of the populations and record the observed frequencies, \(f_{ij}\), in a table with 2 rows and k columns.
Expected Frequencies Under the Assumption \({H_0}\) is true : Refer equation (32.1)
\[e_{ij} = \frac{(\text{Row } i \text{ Total})(\text{Column } j \text{ Total})}{\text{Total Sample Size}} \tag{32.1}\]
Chi-Square Test Statistic : Refer equation (32.2)
\[{\chi^2} = \sum_{i}{\sum_{j}{\frac{(f_{ij} - e_{ij})^2}{e_{ij}}}} \tag{32.2}\]
Where:
\[\begin{align} f_{ij} &= \text{observed frequency for the cell in row } i \text{ and column } j \\ e_{ij} &= \text{expected frequency for the cell in row } i \text{ and column } j \end{align}\]
Note: In a chi-square test involving the equality of \({k}\) population proportions, the above test statistic has a chi-square distribution with \({k - 1}\) degrees of freedom provided the expected frequency is 5 or more for each cell.
A chi-square test for equal population proportions will always be an upper tail test with rejection of \({H_0}\) occurring when the test statistic is in the upper tail of the chi-squre distribution.
In studies such as these, we often use the same sample size for each population. We have chosen different sample sizes in this example to show that the chi-square test is not restricted to equal sample sizes for each of the k populations.
“ForLater” - Creating the ChiSq Table
Example: JD Power: Compare customer loyalty for three automobiles by using the proportion of owners likely to repurchase a particular automobile
pchisq(q = 7.89, df = 2,lower.tail = FALSE) \(\#\mathcal{R}\)“ForLater” - Marascuilo procedure
An important application of a chi-square test involves using sample data to test for the independence of two categorical variables. The null hypothesis for this test is that the two categorical variables are independent. Thus, the test is referred to as a test of independence.
Example: Beer: Preference vs. gender
All tests apply to categorical variables and all tests use a chi-square \({\chi^2}\) test statistic that is based on the differences between observed frequencies and expected frequencies. In each case, expected frequencies are computed under the assumption that the null hypothesis is true. These chi-square tests are upper tailed tests. Large differences between observed and expected frequencies provide a large value for the chi-square test statistic and indicate that the null hypothesis should be rejected.
The test for the equality of population proportions for three or more populations is based on independent random samples selected from each of the populations. The sample data show the counts for each of two categorical responses for each population. The null hypothesis is that the population proportions are equal. Rejection of the null hypothesis supports the conclusion that the population proportions are not all equal.
The test of independence between two categorical variables uses one sample from a population with the data showing the counts for each combination of two categorical variables. The null hypothesis is that the two variables are independent and the test is referred to as a test of independence. If the null hypothesis is rejected, there is statistical evidence of an association or dependency between the two variables.
The goodness of fit test is used to test the hypothesis that a population has a specific historical or theoretical probability distribution. We showed applications for populations with a multinomial probability distribution and with a normal probability distribution. Since the normal probability distribution applies to continuous data, intervals of data values were established to create the categories for the categorical variable required for the goodness of fit test.
Example: Chemitech: Comparision of theree methods of assembly A, B, C in terms of most assemblies per week
xxChemitech <- tibble(A = c(58, 64, 55, 66, 67),
B = c(58, 69, 71, 64, 68),
C = c(48, 57, 59, 47, 49))
aa <- xxChemitech
# #Summary
aa %>%
pivot_longer(everything(), names_to = "key", values_to = "value") %>%
group_by(key) %>%
summarise(across(value,
list(Count = length, Mean = mean, SD = sd, Variance = var),
.names = "{.fn}"))
## # A tibble: 3 x 5
## key Count Mean SD Variance
## <chr> <int> <dbl> <dbl> <dbl>
## 1 A 5 62 5.24 27.5
## 2 B 5 66 5.15 26.5
## 3 C 5 52 5.57 3133.2 \(\text{\{ANOVA\}} {H_0} : {\mu}_1 = {\mu}_2 = \dots = {\mu}_k \iff {H_a}: \text{Not all population means are equal}\)
If \({H_0}\) is rejected, we cannot conclude that all population means are different. Rejecting \({H_0}\) means that at least two population means have different values.
Three assumptions are required to use analysis of variance.
If the sample sizes are equal, analysis of variance is not sensitive to departures from the assumption of normally distributed populations.
Normality:
If the means for the three populations are equal, we would expect the three sample means to be close together. The more the sample means differ, the stronger the evidence we have for the conclusion that the population means differ. In other words, if the variability among the sample means is “small,” it supports \({H_0}\); if the variability among the sample means is “large,” it supports \({H_a}\).
If the null hypothesis is true, we can use the variability among the sample means to develop an estimate of \({\sigma}^2\).
First, note that if the assumptions for analysis of variance are satisfied and the null hypothesis is true, each sample will have come from the same normal distribution with mean \({\mu}\) and variance \({\sigma}^2\).
Recall that the sampling distribution of the sample mean \({\overline{x}}\) for a simple random sample of size \({n}\) from a normal population will be normally distributed with mean \({\mu}\) and variance \({\sigma}_{{\overline{x}}}^2 = \frac{{\sigma}^2}{n}\).
In this case, the mean and variance of the three sample mean values \(\{{\overline{x}}_1 = 62, {\overline{x}}_1 = 66, {\overline{x}}_1 = 52\}\) can be used to estimate the mean and variance of the sampling distribution.
When the sample sizes are equal, as in this example, the best estimate of the mean of the sampling distribution of \({\overline{x}}\) is the mean or average of the sample means.
In this example, an estimate of the mean of the sampling distribution of \({\overline{x}}\) is \((62 + 66 + 52)/3 = 60\). We refer to this estimate as the overall sample mean. Refer equation (33.5)
An estimate of the variance of the sampling distribution of \({\overline{x}}\), \({\sigma}_{{\overline{x}}}^2\), is provided by the variance of the three sample means.
\[{s}_{\overline{x}}^2 = \frac{(62 - 60)^2 + (66 - 60)^2 + (52 - 60)^2}{3 - 1} = 52\] Because \({\sigma}^2 = n {\sigma}_{{\overline{x}}}^2\), the estimate can be given by
\[E_{{\sigma}^2} = n E_{{\sigma}_{{\overline{x}}}^2} = n {s}_{\overline{x}}^2 = 5 * 52 = 260\]
The \(n {s}_{\overline{x}}^2\) is referred as between-treatments estimate of \({\sigma}^2\). It is based on the assumption that the null hypothesis is true. In this case, each sample comes from the same population, and there is only one sampling distribution of \({\overline{x}}\).
In contrast, when the population means are not equal, the between-treatments estimate will overestimate the population variance \({\sigma}^2\).
The variation within each of the samples also has an effect on the conclusion we reach in analysis of variance. When a simple random sample is selected from each population, each of the sample variances provides an unbiased estimate of \({\sigma}^2\). Hence, we can combine or pool the individual estimates of \({\sigma}^2\) into one overall estimate. The estimate of \({\sigma}^2\) obtained in this way is called the pooled or within-treatments estimate of \({\sigma}^2\).
Because each sample variance provides an estimate of \({\sigma}^2\) based only on the variation within each sample, the within-treatments estimate of \({\sigma}^2\) is not affected by whether the population means are equal. When the sample sizes are equal, the within-treatments estimate of \({\sigma}^2\) can be obtained by computing the average of the individual sample variances \(\{27.5, 26.5, 31\}\).
For this exmple we obtain:
\[\text{Within-treatments estimate of } {\sigma}^2 = \frac{27.5 + 26.5 + 31}{3} = 28.33\] Remember, that the between-treatments approach provides a good estimate of \({\sigma}^2\) only if the null hypothesis is true; if the null hypothesis is false, the between-treatments approach overestimates \({\sigma}^2\). The within-treatments approach provides a good estimate of \({\sigma}^2\) in either case.
Thus, if the null hypothesis is true, the two estimates will be similar and their ratio will be close to 1.
If the null hypothesis is false, the between-treatments estimate will be larger than the within-treatments estimate, and their ratio will be large.
Analysis of Variance and the Completely Randomized Design
where \({\mu}_j\) is mean of the \(j^{\text{th}}\) population. We assume that a simple random sample of size \({n}_j\) has been selected from each of the \({k}\) populations or treatments. For the resulting sample data, let
\[\begin{align} {x}_{ij} &= \text{value of observation } i \text{ for treatment } j \\ {n}_{j} &= \text{number of observations for treatment } j \\ {\overline{x}}_{j} &= \text{sample mean for treatment } j \\ {s}_{j}^2 &= \text{sample variance for treatment } j \\ {s}_{j} &= \text{sample e standard deviation for treatment } j \end{align}\]
The formulas for the sample mean and sample variance for treatment \({j}\) are given in equations (33.1) and (33.2)
\[{\overline{x}}_j = \frac{\sum_{i=1}^{n_j}{x}_{ij}}{{n}_j} \tag{33.1}\]
\[{s}_j^2 = \frac{\sum_{i=1}^{n_j}{\left({x}_{ij} - {\overline{x}}_j\right)^2}}{{n}_j - 1} \tag{33.2}\]
The overall sample mean, denoted \({\overline{\overline{x}}}\), is the sum of all the observations divided by the total number of observations.
\[{\bar{\bar{x}}} = \frac{\sum_{j=1}^k{\sum_{i=1}^{{n}_j}{{x}_{ij}}}}{{n}_T} \tag{33.3}\]
Where
\[{n}_T = {n}_1 + {n}_2 + \cdots + {n}_k \tag{33.4}\]
If the size of each sample is {n}, the equation (33.4) becomes \({n}_T = kn\), and the equation (33.3) reduces to (33.5)
\[{\bar{\bar{x}}} = \frac{\sum_{j=1}^k{{\overline{x}}_{j}}}{k} \tag{33.5}\]
Thus, whenever the sample sizes are the same, the overall sample mean is just the average of the \({k}\) sample means.
Thus, in the example, from (33.5), \({\bar{\bar{x}}} = \frac{62 + 66 + 52}{3} = 60\)
Between-Treatments Estimate of Population Variance
The between-treatments estimate of \({\sigma}^2\) is called the mean square due to treatments and is denoted \(\text{MSTR}\). Refer equation (33.6)
\[\text{MSTR} = \frac{\text{SSTR}}{k - 1} = \frac{\sum_{j=1}^{k}{n}_j\left({\overline{x}}_j - {\bar{\bar{x}}} \right)^2}{k - 1} \tag{33.6}\]
The numerator in equation (33.6) is called the sum of squares due to treatments and is denoted \(\text{SSTR}\). The denominator, \({k − 1}\), represents the degrees of freedom associated with SSTR. Refer equation (33.7)
\[\text{SSTR} = \sum_{j=1}^{k}{n}_j\left({\overline{x}}_j - {\bar{\bar{x}}} \right)^2 \tag{33.7}\]
If \({H_0}\) is true, MSTR provides an unbiased estimate of \({\sigma}^2\). However, if the means of the \({k}\) populations are not equal, MSTR is not an unbiased estimate of \({\sigma}^2\); in fact, in that case, MSTR should overestimate \({\sigma}^2\).
In the example:
Within-Treatments Estimate of Population Variance
The within-treatments estimate of \({\sigma}^2\) is called the mean square due to error and is denoted \(\text{MSE}\). Refer equation (33.8)
\[\text{MSE} = \frac{\text{SSE}}{{n}_T - k} = \frac{\sum_{j=1}^{k}{({n}_j - 1){s}_j^2}}{{n}_T - k} \tag{33.8}\]
The numerator in equation (33.8) is called the sum of squares due to error and is denoted \(\text{SSE}\). The denominator, \({{n}_T - k}\) is referred to as the degrees of freedom associated with SSE. Refer equation (33.9)
\[\text{SSE} = \sum_{j=1}^{k}{({n}_j - 1){s}_j^2} \tag{33.9}\]
In the example:
Comparing the Variance Estimates
If the null hypothesis is true, \(\text{MSTR}\) and \(\text{MSE}\) provide two independent, unbiased estimates of \({\sigma}^2\).
Refer Variance.
We know that for normal populations, the sampling distribution of the ratio of two independent estimates of \({\sigma}^2\) follows an F distribution. Hence, if the null hypothesis is true and the ANOVA assumptions are valid, the sampling distribution of \(\frac{\text{MSTR}}{\text{MSE}}\) is an F distribution with numerator degrees of freedom equal to \({k - 1}\) and denominator degrees of freedom equal to \({{n}_T - k}\).
In other words, if the null hypothesis is true, the value of MSTR/MSE should appear to have been selected from this F distribution. However, if the null hypothesis is false, the value of \(\frac{\text{MSTR}}{\text{MSE}}\) will be inflated because MSTR overestimates \({\sigma}^2\). Hence, we will reject \({H_0}\) if the resulting value of \(\frac{\text{MSTR}}{\text{MSE}}\) appears to be too large to have been selected from an F distribution with \({k - 1}\) numerator degrees of freedom and \({{n}_T - k}\) denominator degrees of freedom.
Because the decision to reject \({H_0}\) is based on the value of \(\frac{\text{MSTR}}{\text{MSE}}\), the test statistic used to test for the equality of \({k}\) population means is given by equation (33.10)
Test Statistic for the Equality of \({k}\) Population Means :
\[F = \frac{\text{MSTR}}{\text{MSE}} \tag{33.10}\]
Because we will only reject the null hypothesis for large values of the test statistic, the p-value is the upper tail area of the F distribution to the right of the test statistic \({F}\).
pf(q = 9.18, df1 = 2, df2 = 12, lower.tail = FALSE) \(\#\mathcal{R}\)The sum of squares associated with the source of variation referred to as “Total” is called the total sum of squares (SST). SST divided by its degrees of freedom \({n}_T - 1\) is nothing more than the overall sample variance that would be obtained if we treated the entire set of 15 observations as one data set. With the entire data set as one sample. Refer equation (33.11)
\[\text{SST} = \text{SSTR} + \text{SSE} = \sum_{j=1}^k{\sum_{i=1}^{{n}_j}{\left( {x}_{ij} - \bar{\bar{x}}\right)^2}} \tag{33.11}\]
The degrees of freedom associated with this total sum of squares is the sum of the degrees of freedom associated with the sum of squares due to treatments and the sum of squares due to error i.e. \({n}_T - 1 = (k - 1) + ({n}_T - k)\).
In other words, SST can be partitioned into two sums of squares: the sum of squares due to treatments and the sum of squares due to error. Note also that the degrees of freedom corresponding to SST, \({n}_T - 1\), can be partitioned into the degrees of freedom corresponding to SSTR, \(k - 1\), and the degrees of freedom corresponding to SSE, \({n}_T - k\).
The analysis of variance can be viewed as the process of partitioning the total sum of squares and the degrees of freedom into their corresponding sources: treatments and error. Dividing the sum of squares by the appropriate degrees of freedom provides the variance estimates, the F value, and the p-value used to test the hypothesis of equal population means.
The square root of MSE provides the best estimate of the population standard deviation \({\sigma}\). This estimate of \({\sigma}\) on the computer output is Pooled StDev.
str(aa)
## tibble [5 x 3] (S3: tbl_df/tbl/data.frame)
## $ A: num [1:5] 58 64 55 66 67
## $ B: num [1:5] 58 69 71 64 68
## $ C: num [1:5] 48 57 59 47 49
bb <- aa %>% pivot_longer(everything(), names_to = "key", values_to = "value")
#
# #ANOVA
ii_aov <- aov(formula = value ~ key, data = bb)
#names(ii_aov)
#ii_aov
#
# #
model.tables(ii_aov, type = "means")
## Tables of means
## Grand mean
##
## 60
##
## key
## key
## A B C
## 62 66 52
#
# #Summary
summary(ii_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## key 2 520 260.00 9.176 0.00382 **
## Residuals 12 340 28.33
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Analysis of variance (ANOVA) can be used to test for differences among means of several populations or treatments.
The completely randomized design and the randomized block design are used to draw conclusions about differences in the means of a single factor. The primary purpose of blocking in the randomized block design is to remove extraneous sources of variation from the error term. Such blocking provides a better estimate of the true error variance and a better test to determine whether the population or treatment means of the factor differ significantly.
The basis for the statistical tests used in analysis of variance and experimental design is the development of two independent estimates of the population variance \({\sigma}^2\). In the single-factor case, one estimator is based on the variation between the treatments; this estimator provides an unbiased estimate of \({\sigma}^2\) only if the means \(\{{\mu}_1, {\mu}_2, \ldots, {\mu}_k\}\) are all equal. A second estimator of \({\sigma}^2\) is based on the variation of the observations within each sample; this estimator will always provide an unbiased estimate of \({\sigma}^2\).
By computing the ratio of these two estimators (the F statistic), it is determined whether to reject the null hypothesis that the population or treatment means are equal.
In all the experimental designs considered, the partitioning of the sum of squares and degrees of freedom into their various sources enabled us to compute the appropriate values for the analysis of variance calculations and tests.
Further, Fisher LSD procedure and the Bonferroni adjustment can be used to perform pairwise comparisons to determine which means are different.
\[{y} = {\beta}_0 + {\beta}_1 {x} + {\epsilon} \tag{34.1}\]
Note
Multiple regression analysis enables us to understand how a dependent variable is related to two or more independent variables.
Parametric methods mostly require quantitative data. However these are generally sometimes more powerful than nonparametric methods.
Most of the statistical methods referred to as parametric methods require quantitative data, while nonparametric methods allow inferences based on either categorical or quantitative data.
Definitions and Exercises are from the Book (Daniel T. Larose 2015)
Refer Lecture: Data Pre-Processing Refer Numerical Measures
Please import the "B18-Churn.xlsx". Please import the "B16-Cars2.csv".
summary(cut_number(diamonds$depth, n = 27)) Passedsummary(cut_number(diamonds$depth, n = 28)) Failed (N = 53940)# #n = 12
bb <- c(1, 1, 1, 1, 1, 2, 2, 11, 11, 12, 12, 44)
#
# #Fixing Number of Bins: Unequal number of Observations and also Bins with 0 Observations
summary(cut(bb, breaks = 3))
## (0.957,15.3] (15.3,29.7] (29.7,44]
## 11 0 1
summary(cut_interval(bb, n = 3))
## [1,15.3] (15.3,29.7] (29.7,44]
## 11 0 1
#
# #For reference, NOT equivalent to above.
summary(cut_width(bb, width = 15))
## [-7.5,7.5] (7.5,22.5] (22.5,37.5] (37.5,52.5]
## 7 4 0 1
#
# #Using Equal Frequency: Same Observation may belong to different consecutive Bins
if(FALSE) ggplot2::cut_number(bb, n = 3) #ERROR
if(FALSE) { #Works
#ceiling(seq_along(bb)/4)[rank(bb, ties.method = "first")]
tibble(bb = bb, RANK = rank(bb, ties.method = "first"),
ALONG = seq_along(bb), CEIL = ceiling(seq_along(bb)/4))
}
#
# #dplyr::ntile() can be used in place of ggplot2::cut_number()
dplyr::ntile(bb, n = 3)
## [1] 1 1 1 1 2 2 2 2 3 3 3 3
#Caution: ID fields should be filtered out from the data mining algorithms, but should not be removed from the data. These are for easy identification of records not for correlation.
mtcars %>% mutate(ID = row_number()) %>% relocate(ID) %>% slice(1:6L)
## ID mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1Please import the "B18-Churn.xlsx".
| Key | Min | Max | SD | Mean | Median | Mode | Unique | isNA | Skewness | p_Shapiro | isNormal |
|---|---|---|---|---|---|---|---|---|---|---|---|
| day_mins | 0 | 350.8 | 54.467 | 179.78 | 179.4 | 154 | 1667 | 0 | -0.0291 | 0.6401 | TRUE |
| day_calls | 0 | 165 | 20.069 | 100.44 | 101 | 102 | 119 | 0 | -0.1117 | 0.0003 | FALSE |
| day_charge | 0 | 59.6 | 9.259 | 30.56 | 30.5 | 26.18 | 1667 | 0 | -0.0291 | 0.6401 | TRUE |
| eve_mins | 0 | 363.7 | 50.714 | 200.98 | 201.4 | 169.9 | 1611 | 0 | -0.0239 | 0.7125 | TRUE |
| eve_calls | 0 | 170 | 19.923 | 100.11 | 100 | 105 | 123 | 0 | -0.0555 | 0.0088 | FALSE |
| eve_charge | 0 | 30.9 | 4.311 | 17.08 | 17.12 | 16.12 | 1440 | 0 | -0.0238 | 0.7091 | TRUE |
| night_mins | 23.2 | 395 | 50.574 | 200.87 | 201.2 | 188.2 | 1591 | 0 | 0.0089 | 0.627 | TRUE |
| night_calls | 33 | 175 | 19.569 | 100.11 | 100 | 105 | 120 | 0 | 0.0325 | 0.2514 | TRUE |
| night_charge | 1.04 | 17.8 | 2.276 | 9.04 | 9.05 | 9.66 | 933 | 0 | 0.0089 | 0.6238 | TRUE |
| intl_mins | 0 | 20 | 2.792 | 10.24 | 10.3 | 10 | 162 | 0 | -0.2449 | 0 | FALSE |
| intl_calls | 0 | 20 | 2.461 | 4.48 | 4 | 3 | 21 | 0 | 1.3203 | 0 | FALSE |
| intl_charge | 0 | 5.4 | 0.754 | 2.76 | 2.78 | 2.7 | 162 | 0 | -0.2451 | 0 | FALSE |
| custserv_calls | 0 | 9 | 1.315 | 1.56 | 1 | 1 | 10 | 0 | 1.0904 | 0 | FALSE |
| vmail_message | 0 | 51 | 13.688 | 8.1 | 0 | 0 | 46 | 0 | 1.2637 | 0 | FALSE |
| account_length | 1 | 243 | 39.822 | 101.06 | 101 | 105 | 212 | 0 | 0.0965 | 0.0012 | FALSE |
|
Churn \(\rightarrow\) \(\downarrow\) International |
No |
Yes |
Row SUM |
Row No % |
Row Yes % |
Col No % |
Col Yes % |
Col Row SUM % |
|---|---|---|---|---|---|---|---|---|
| No | 2664 | 346 | 3010 | 88.5% | 11.5% | 93.5% | 71.6% | 90.3% |
| Yes | 186 | 137 | 323 | 57.6% | 42.4% | 6.5% | 28.4% | 9.7% |
| Total | 2850 | 483 | 3333 | 85.5% | 14.5% | 100% | 100% | 100% |
Figure 41.1 International Plan holders tend to churn more frequently
# #xyw: x (independent), y (dependent), c (column =y), r (row =x)
# #g (grouped), w (wider), l (longer),
r_xyg <- "International"
c_xyg <- "Churn"
xyg <- bb %>% select(int_l_plan, churn) %>% rename(Predictor = 1, Target = 2) %>%
mutate(across(1:2, ~ifelse(., 'Yes', 'No'))) %>%
count(Predictor, Target)
#
str(xyg)# #IN: xyg (Predictor, Target), r_xyg, c_xyg
# #Generate Contingency Table (CrossTab)
ctab <- xyg %>%
pivot_wider(names_from = Target, values_from = n, values_fill = 0, names_sort = TRUE) %>%
#mutate(across(1, as.character)) %>%
add_row(summarise(., across(1, ~ "Total")), summarise(., across(where(is.numeric), sum))) %>%
mutate(SUM = rowSums(across(where(is.numeric)))) %>%
mutate(across(where(is.numeric) & !matches("SUM"),
list(r = ~ round(. /SUM, 3)), .names = "xx_{.fn}{.col}")) %>%
mutate(across(where(is.numeric) & !starts_with("xx_") & !matches("SUM"),
list(rp = ~ paste0(round(100 * . /SUM, 1), "%")), .names = "{.fn}{.col}")) %>%
mutate(across(where(is.numeric) & !starts_with("xx_"),
list(c = ~ 2 * ./sum(.)), .names = "yy_{.fn}{.col}")) %>%
mutate(across(where(is.numeric) & !starts_with("xx_") & !starts_with("yy_"),
list(cp = ~ paste0(round(100 * 2 * . /sum(.), 1), "%")), .names = "{.fn}{.col}"))hh <- ctab %>% select(-c(5:6, 9:11))
#names_hh <- names(hh)
names_hh <-c(paste0(c_xyg, " ", "$\\rightarrow$", "<br/>", "$\\downarrow$", " ", r_xyg),
"No <br/> <br/>", "Yes <br/> <br/>", "Row <br/> SUM",
"Row <br/> No %", "Row <br/> Yes %",
"Col <br/> No %", "Col <br/> Yes %", "Col <br/> Row SUM %")
stopifnot(identical(ncol(hh), length(names_hh)))
stopifnot(nrow(hh) < 10)
#
cap_hh <- paste0("(C33T02) ", r_xyg, " vs. ", c_xyg)#
kbl(hh,
caption = cap_hh,
col.names = names_hh,
escape = FALSE, align = "c", booktabs = TRUE
) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
html_font = "Consolas", font_size = 12,
full_width = FALSE,
#position = "float_left",
fixed_thead = TRUE
) %>%
# #Header Row Dark & Bold: RGB (48, 48, 48) =HEX (#303030)
row_spec(0, color = "white", background = "#303030", bold = TRUE,
extra_css = "border-bottom: 1px solid; border-top: 1px solid"
) #%>% row_spec(row = 1:nrow(hh), color = "black")|
Churn \(\rightarrow\) \(\downarrow\) VoiceMail |
No |
Yes |
Row SUM |
Row No % |
Row Yes % |
Col No % |
Col Yes % |
Col Row SUM % |
|---|---|---|---|---|---|---|---|---|
| No | 2008 | 403 | 2411 | 83.3% | 16.7% | 70.5% | 83.4% | 72.3% |
| Yes | 842 | 80 | 922 | 91.3% | 8.7% | 29.5% | 16.6% | 27.7% |
| Total | 2850 | 483 | 3333 | 85.5% | 14.5% | 100% | 100% | 100% |
Figure 41.2 Churn vs. International & Voice Mail Plans (Both are Same Graphs)
Figure 41.3 Churn vs. International & Voice Mail Plans (Proportional Stack)
xsyg <- bb %>% select(int_l_plan, vmail_plan, churn) %>%
rename(Intl = 1, Vmail =2, Churn = 3) %>%
mutate(across(Churn, ~ifelse(., "Yes", "No"))) %>%
mutate(across(Intl, ~ifelse(., "I: Yes", "I: No"))) %>%
mutate(across(Vmail, ~ifelse(., "V: Yes", "V: No"))) %>%
count(Intl, Vmail, Churn) %>% rename(N = n)# #Multilayer Clustered Bar Chart
hh <- xsyg
#
cap_hh <- "C33P04"
ttl_hh <- "Churn: Churn vs. International & Voice Mail Plans"
sub_hh <- NULL
x_hh <- "Churn" #r_xyg
y_hh <- "Frequency"
lgd_hh <- "Churn" #c_xyg
#
C33 <- hh %>% { ggplot(., aes(x = Churn, y = N, fill = Churn)) +
geom_bar(position = "dodge", stat = "identity", alpha = 1) +
geom_text(position = position_stack(vjust = 0.5), aes(label = N)) +
facet_wrap(Intl ~ Vmail, nrow = 1) +
scale_fill_manual(values = c('#FFEA46FF', '#787877FF')) +
theme(panel.grid.major.x = element_blank(), axis.line = element_blank(),
panel.border = element_rect(colour = "black", fill = NA, size = 1),
legend.position = 'top',
legend.box = "horizontal", legend.direction = "horizontal") +
labs(x = x_hh, y = y_hh, fill = lgd_hh,
subtitle = sub_hh, caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, C33)
rm(C33)# #Multilayer Clustered Bar Chart
hh <- xsyg
#
cap_hh <- "C33P05"
ttl_hh <- "Churn: Churn vs. International & Voice Mail Plans"
sub_hh <- NULL
x_hh <- NULL #"Churn" #r_xyg
y_hh <- "Frequency"
lgd_hh <- "Churn" #c_xyg
#
C33 <- hh %>% { ggplot(., aes(x = Vmail, y = N, fill = Churn)) +
geom_bar(position = "dodge", stat = "identity", alpha = 1) +
geom_text(position = position_dodge(width = 1), aes(label = N), vjust = 1.5) +
facet_wrap(~Intl, nrow = 1) +
scale_fill_manual(values = c('#FFEA46FF', '#787877FF')) +
theme(panel.grid.major.x = element_blank(), axis.line = element_blank(),
panel.border = element_rect(colour = "black", fill = NA, size = 1),
legend.position = 'top',
legend.box = "horizontal", legend.direction = "horizontal") +
labs(x = NULL, y = y_hh, fill = lgd_hh,
subtitle = sub_hh, caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, C33)
rm(C33)# #Percent Stacked Bar Chart
hh <- xsyg %>% group_by(Vmail, Intl) %>%
mutate(Ratio = paste0(round(100 * N/sum(N), 1), "%")) %>% ungroup()
#
cap_hh <- "C33P06"
ttl_hh <- "Churn: Churn vs. International & Voice Mail Plans"
sub_hh <- NULL
x_hh <- "Voice Mail"
y_hh <- "Grouped Percentage"
lgd_hh <- "Churn"
#
C33 <- hh %>% { ggplot(., aes(x = Vmail, y = N, fill = Churn)) +
geom_bar(position = "fill", stat = 'identity') +
facet_wrap(~Intl, nrow = 1) +
geom_text(position = position_fill(vjust = 0.5), aes(label = Ratio),
colour = rep(c("black", "white"), 4)) +
scale_fill_viridis_d(direction = -1) +
scale_y_continuous(labels = percent) +
theme(panel.grid.major.x = element_blank(), axis.line = element_blank(),
panel.border = element_rect(colour = "black", fill = NA, size = 1)) +
labs(x = x_hh, y = y_hh, fill = lgd_hh,
subtitle = sub_hh, caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, C33)
rm(C33)Figure 41.4 Churn: All Histograms
Figure 41.5 Service Calls beyond 3 have signficant increase in Churn
Figure 41.6 Churn vs. Service Call Proportions (Bar)
ii <- bb %>% select(custserv_calls, churn) %>%
rename(Churn = 2) %>%
mutate(across(Churn, ~ifelse(., "Yes", "No")))
# #Histogram Default (Not useful for association of predictor and target)
hh <- ii
ttl_hh <- "Churn: Customer Service Calls"
cap_hh <- "C33P08"
sub_hh <- "Predictor Only"
x_hh <- "x"
y_hh <- "Frequency"
lgd_hh <- "Churn"
#
C33 <- hh %>% { ggplot(data = ., mapping = aes(x = custserv_calls, fill = '#FDE725FF')) +
geom_histogram(bins = length(unique(.[[1]])), alpha = 1) +
#stat_bin(bins = length(unique(.[[1]])), aes(y=..count.., label=..count..),
# geom="text", position=position_stack(vjust=0.5)) +
scale_x_continuous(breaks = breaks_pretty()) +
scale_fill_viridis_d(direction = -1) +
theme(plot.title.position = "panel",
axis.title.x = element_blank(),
#legend.position = c(0.5, -0.08), legend.direction = 'horizontal',
legend.position = 'none') +
labs(x = x_hh, y = y_hh, #fill = lgd_hh,
caption = cap_hh, subtitle = sub_hh, title = ttl_hh)
}
assign(cap_hh, C33)
rm(C33)
# #Histogram (Predictor and Target Count)
hh <- ii
ttl_hh <- "Churn: Customer Service Calls & Churn"
cap_hh <- "C33P09"
sub_hh <- "Count of Predictor & Target"
x_hh <- "x"
y_hh <- "Frequency"
lgd_hh <- "Churn"
#
C33 <- hh %>% { ggplot(data = ., mapping = aes(x = custserv_calls, fill = Churn)) +
geom_histogram(bins = length(unique(.[[1]])), alpha = 1) +
#stat_bin(bins = length(unique(.[[1]])), aes(y=..count.., label=..count..),
# geom="text", position=position_stack(vjust=0.5)) +
scale_x_continuous(breaks = breaks_pretty()) +
scale_fill_viridis_d(direction = -1) +
theme(plot.title.position = "panel",
axis.title.x = element_blank(),
legend.position = c(0.5, -0.07), legend.direction = 'horizontal') +
labs(x = x_hh, y = y_hh, fill = lgd_hh,
caption = cap_hh, subtitle = sub_hh, title = ttl_hh)
}
assign(cap_hh, C33)
rm(C33)
# #Histogram (Predictor and Target Proportion)
hh <- ii
ttl_hh <- "Churn: Customer Service Calls & Churn"
cap_hh <- "C33P10"
sub_hh <- "Proportion of Predictor & Target (Using Histogram)"
x_hh <- "x"
y_hh <- "Grouped Percentage"
lgd_hh <- "Churn"
#
# #NOTES: Group wise Proportion i.e. Yes 1, No 1 (Not Each Bin wise)
# #Adding Density automatically converts the y-axis to 'frequency density' not to 'percentage'
# #Both will match when Bin Width =1 but otherwise there will be a mismatch
# #Using y=..density.. scales the histograms so the area under each is 1, or sum(binwidth*y)=1.
# #So use y = binwidth *..density.. to have y represent the fraction of the total in each bin.
# #OR aes(y = stat(width*density))
# #OR aes(y = stat(count / sum(count)))
# #Clarification
# # ..count../sum(..count..) each count is divided by the total count
# # ..density.. it is applied to each group independently
#
C33 <- hh %>% { ggplot(data = ., mapping = aes(x = custserv_calls, fill = Churn)) +
geom_histogram(bins = length(unique(.[[1]])), alpha = 1, position = 'fill') +
#stat_bin(bins = length(unique(.[[1]])),
# aes(y = c(..count..[..group..==1]/sum(..count..[..group..==1]),
# ..count..[..group..==2]/sum(..count..[..group..==2])),
# label=round(..density.., 2)), geom="text") +
#stat_bin(bins = length(unique(.[[1]])),
# aes(y=..density.., group = ..group..,
# label=round(..density.., 2)), geom="text") +
scale_x_continuous(breaks = breaks_pretty()) +
#Deprecated : percent_format()
scale_y_continuous(labels = percent) +
scale_fill_viridis_d(direction = -1) +
theme(plot.title.position = "panel",
axis.title.x = element_blank(),
legend.position = c(0.5, -0.07), legend.direction = 'horizontal') +
labs(x = x_hh, y = y_hh, fill = lgd_hh,
caption = cap_hh, subtitle = sub_hh, title = ttl_hh)
}
assign(cap_hh, C33)
rm(C33)# #complete() can be used to add the missing combination but not using it for now
# #NOTE: Bar is easier to include Labels, However Histogram is easier for large number of bins
xyg <- ii %>% count(custserv_calls, Churn)
#%>% complete(custserv_calls, Churn, fill = list(n = 0))
hh <- xyg %>% rename(Group = 1, SubGroup = 2, N = 3) %>%
group_by(Group) %>%
mutate(Ratio = paste0(round(100 * N/sum(N), 1), "%")) %>% ungroup()
#
ttl_hh <- "Churn: Customer Service Calls & Churn"
cap_hh <- "C33P11"
sub_hh <- "Proportion of Predictor & Target (Using Bar)"
x_hh <- "x"
y_hh <- "Grouped Percentage"
lgd_hh <- "Churn"
#
C33 <- hh %>% { ggplot(., aes(x = Group, y = N, fill = SubGroup)) +
geom_bar(position = "fill", stat = 'identity') +
geom_text(position = position_fill(vjust = 0.5),
aes(label = ifelse(SubGroup == 'No', "", Ratio)),
colour = c(rep(c("black", "white"), 9), "white")) +
scale_fill_viridis_d(direction = -1) +
scale_x_continuous(breaks = breaks_pretty()) +
scale_y_continuous(labels = percent) +
theme(plot.title.position = "panel", axis.title.x = element_blank(),
panel.grid.major.x = element_blank(), axis.line = element_blank(),
panel.border = element_rect(colour = "black", fill = NA, size = 1),
legend.position = c(0.5, -0.07), legend.direction = 'horizontal') +
labs(x = x_hh, y = y_hh, fill = lgd_hh,
subtitle = sub_hh, caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, C33)
rm(C33)Figure 41.7 Higher Day Minutes (>200) have higher Churn
Figure 41.8 (Inconclusive) Slight tendency to Churn with higher Evening Minutes
Figure 41.9 No obvious association between Churn and Night Minutes
Figure 41.10 No obvious association between Churn and International Calls (But t-test)
# #t-test for difference in Mean of Target Values
str(ii)
## tibble [3,333 x 2] (S3: tbl_df/tbl/data.frame)
## $ Predictor: int [1:3333] 3 3 5 7 3 6 7 6 4 5 ...
## $ Target : chr [1:3333] "No" "No" "No" "No" ...
#
# #Variance
var_ii <- var.test(formula = Predictor ~ Target, data = ii)
var_ii
##
## F test to compare two variances
##
## data: Predictor by Target
## F = 0.91594, num df = 2849, denom df = 482, p-value = 0.1971
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.7962798 1.0465337
## sample estimates:
## ratio of variances
## 0.9159437
#
isVarEqual <- ifelse(var_ii$p.value > 0.05, TRUE, FALSE)
if(isVarEqual) print("Variances are Equal.") else print("Variances are Different.")
## [1] "Variances are Equal."
#
# #t-test: Welch
ha_ii <- "two.sided" #"two.sided", "less", "greater"
#tt_ii <- t.test(Predictor ~ Target, data = ii, alternative = ha_ii, var.equal = isVarEqual)
tt_ii <- t.test(Predictor ~ Target, data = ii, alternative = ha_ii, var.equal = FALSE)
tt_ii
##
## Welch Two Sample t-test
##
## data: Predictor by Target
## t = 2.9604, df = 640.64, p-value = 0.003186
## alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
## 95 percent confidence interval:
## 0.1243807 0.6144620
## sample estimates:
## mean in group No mean in group Yes
## 4.532982 4.163561
#
alpha <- 0.05
if(any(all(ha_ii == "two.sided", tt_ii$p.value >= alpha / 2),
all(ha_ii != "two.sided", tt_ii$p.value >= alpha))) {
print("Failed to reject H0.")
} else {
print("H0 Rejected.")
}
## [1] "H0 Rejected."Figure 41.11 Histograms of All Predictors with Target (Count)
Figure 41.12 Histograms of All Predictors with Target (Proportion)
# #All Continuous Predictors and Target (Categorical)
xsy <- bb %>%
select(where(is.numeric) | "churn") %>%
select(!area_code) %>%
relocate(ends_with("_mins")) %>%
relocate(ends_with("_calls")) %>%
relocate(vmail_message, .after = last_col()) %>%
relocate("churn") %>% rename("Target" = 1) %>%
mutate(across(Target, ~ifelse(., "Yes", "No"))) %>%
mutate(across(Target, factor, levels = unique(Target)))
#
xsyl <- xsy %>% pivot_longer(where(is.numeric), names_to = "Predictors", values_to = "Values") %>%
mutate(across(Predictors, ~ factor(., levels = unique(Predictors))))
#
#str(ii)
# #Histogram
hh <- xsyl
ttl_hh <- "Churn: Histograms of All Predictors with Target (Count)"
cap_hh <- "C33P18"
sub_hh <- NULL #"Count of Predictor & Target"
x_hh <- NULL # "x"
y_hh <- NULL # "Frequency" #"Grouped Percentage"
lgd_hh <- "Churn"
#
C33 <- hh %>% { ggplot(data = ., mapping = aes(x = Values, fill = Target)) +
geom_histogram(alpha = 1, boundary = 0, #position = position_stack(reverse = TRUE),
bins = ifelse(length(unique(.$Predictors)) > 50, 50, length(unique(.$Predictors)))) +
#geom_histogram(alpha = 1, boundary = 0,
# bins = ifelse(nrow(distinct(.[2])) > 50, 50, nrow(distinct(.[2])))) +
facet_wrap(~Predictors, nrow = 3, scales = 'free') +
scale_x_continuous(breaks = breaks_pretty()) +
scale_fill_viridis_d(direction = -1) +
theme(plot.title.position = "panel",
strip.text.x = element_text(size = 10, colour = "white"),
axis.title.x = element_blank(),
legend.position = c(0.5, -0.07), legend.direction = 'horizontal') +
labs(x = x_hh, y = y_hh, fill = lgd_hh,
caption = cap_hh, subtitle = sub_hh, title = ttl_hh)
}
assign(cap_hh, C33)
rm(C33)
# #Histogram
hh <- xsyl
ttl_hh <- "Churn: Histograms of All Predictors with Target (Proportion)"
cap_hh <- "C33P19"
sub_hh <- NULL #"Count of Predictor & Target"
x_hh <- NULL # "x"
y_hh <- NULL # "Frequency" #"Grouped Percentage"
lgd_hh <- "Churn"
#
# #Caution: Warning on Removal of Missing Values has been removed
# #NOTE: position_fill() normalizes the Bars and is same as 'fill'
# #
C33 <- hh %>% { ggplot(data = ., mapping = aes(x = Values, fill = Target)) +
geom_histogram(alpha = 1, boundary = 0, position = position_fill(), na.rm = TRUE,
bins = ifelse(length(unique(.$Predictors)) > 50, 50, length(unique(.$Predictors)))) +
#geom_histogram(alpha = 1, boundary = 0,
# bins = ifelse(nrow(distinct(.[2])) > 50, 50, nrow(distinct(.[2])))) +
facet_wrap(~Predictors, nrow = 3, scales = 'free_x') +
scale_x_continuous(breaks = breaks_pretty()) +
scale_fill_viridis_d(direction = -1) +
theme(plot.title.position = "panel",
strip.text.x = element_text(size = 10, colour = "white"),
axis.title.x = element_blank(),
legend.position = c(0.5, -0.07), legend.direction = 'horizontal') +
labs(x = x_hh, y = y_hh, fill = lgd_hh,
caption = cap_hh, subtitle = sub_hh, title = ttl_hh)
}
assign(cap_hh, C33)
rm(C33)Figure 41.13 Scatterplot of Evening Minutes (X) and Day Minutes (Y) shows a clear separation for Churn at High X and High Y
ii <- bb %>% select(day_mins, eve_mins, churn) %>%
rename(Target = churn) %>%
mutate(across(Target, ~ifelse(., "Yes", "No")))
hh <- ii
ttl_hh <- "Churn: Scatterplot of Evening Minutes and Day Minutes"
cap_hh <- "C33P22"
sub_hh <- NULL #subtitle = TeX(r"(Trendline Equation, $R^{2}$, $\bar{x}$ and $\bar{y}$)")
x_hh <- "Evening Minutes" # "x"
y_hh <- "Day Minutes" # "Frequency" #"Grouped Percentage"
lgd_hh <- "Churn"
#
C33 <- hh %>% { ggplot(data = ., aes(x = eve_mins, y = day_mins)) +
#geom_smooth(method = 'lm', formula = k_gg_formula, se = FALSE) +
#geom_point(position = "jitter", aes(colour = Target)) +
geom_jitter(aes(colour = Target)) +
scale_colour_viridis_d(alpha = 0.9, direction = -1) +
theme(panel.grid.minor = element_blank(),
panel.border = element_blank()) +
labs(x = x_hh, y = y_hh, colour = lgd_hh,
caption = cap_hh, subtitle = sub_hh, title = ttl_hh)
}
assign(cap_hh, C33)
rm(C33)Figure 41.14 Scatterplot of Day Minutes (X) and Customer Service Calls (Y) shows an interaction effect for Churn
ii <- bb %>% select(day_mins, custserv_calls, churn) %>%
rename(Target = churn) %>%
mutate(across(Target, ~ifelse(., "Yes", "No")))
hh <- ii
ttl_hh <- "Churn: Day Minutes and Customer Service Calls"
cap_hh <- "C33P27"
sub_hh <- NULL
x_hh <- "Day Minutes"
y_hh <- "Customer Service Calls"
lgd_hh <- "Churn"
#
C33 <- hh %>% { ggplot(data = ., aes(x = day_mins, y = custserv_calls)) +
geom_jitter(aes(colour = Target), width = 0.1, height = 0.1) +
scale_colour_viridis_d(alpha = 0.9, direction = -1) +
scale_y_continuous(breaks = breaks_pretty()) +
theme(panel.grid.minor = element_blank(),
panel.border = element_blank()) +
labs(x = x_hh, y = y_hh, colour = lgd_hh,
caption = cap_hh, subtitle = sub_hh, title = ttl_hh)
}
assign(cap_hh, C33)
rm(C33)Figure 41.15 Bivariate Analysis of Calls showing Churn
Figure 41.16 Bivariate Analysis of Minutes & Charges showing Churn
# #Assumes Column 1 has Target Variable (Factor)
C33 <- hh %>% {
ggpairs(data = ., mapping = aes(colour = Target, fill = Target, alpha = 0.3),
columns = 2:ncol(.),
lower = list(continuous = f_gg_scatter),
diag = list(continuous = f_gg_density)) +
labs(caption = cap_hh, subtitle = sub_hh, title = ttl_hh)
}
assign(cap_hh, C33)
rm(C33)# #For GGally Manual Functions
f_gg_scatter <- function(data, mapping, ...) {
ggplot(data = data, mapping = mapping) +
geom_jitter(...) +
scale_colour_viridis_d(direction = -1)
}
f_gg_density <- function(data, mapping, ...) {
ggplot(data = data, mapping = mapping) +
geom_density(...) +
scale_fill_viridis_d(direction = -1) +
scale_colour_viridis_d(direction = -1)
}Figure 41.17 Bivariate Analysis of Calls showing Churn
Figure 41.18 Bivariate Analysis of Minutes & Charges showing Churn
# #IN: hh, cap_hh, ttl_hh, loc_png
if(!file.exists(loc_png)) {
png(filename = loc_png, width = k_width, height = k_height, units = "in", res = 144)
pairs.panels(hh[2:ncol(hh)], smooth = FALSE, jiggle = TRUE, rug = FALSE, ellipses = FALSE,
bg = rev(viridis(2))[hh$Target], pch = 21, lwd = 1, cex.cor = 1, cex = 1,
gap = 0, main = ttl_hh)
#title(main = ttl_hh, line = 2, adj = 0)
title(sub = cap_hh, line = 4, adj = 1)
C33 <- recordPlot()
dev.off()
assign(cap_hh, C33)
rm(C33)
}Figure 41.19 High Customer Service Calls have major Churn
|
Churn \(\rightarrow\) \(\downarrow\) Service Calls |
No |
Yes |
Row SUM |
Row No % |
Row Yes % |
Col No % |
Col Yes % |
Col Row SUM % |
|---|---|---|---|---|---|---|---|---|
| Low | 2721 | 345 | 3066 | 88.7% | 11.3% | 95.5% | 71.4% | 92% |
| High | 129 | 138 | 267 | 48.3% | 51.7% | 4.5% | 28.6% | 8% |
| Total | 2850 | 483 | 3333 | 85.5% | 14.5% | 100% | 100% | 100% |
# #All Continuous Predictors and Target (Categorical)
# #Select Relevant | Binning | Rename & Relocate | Relabel & Factor Target|
xsy <- bb %>% select( (where(is.numeric) & ! "area_code") | "churn") %>%
relocate(ends_with("_mins")) %>%
relocate(ends_with("_calls")) %>%
relocate(vmail_message, .after = last_col()) %>%
relocate(Target = "churn") %>%
mutate(across(Target, ~ifelse(., "Yes", "No"))) %>%
mutate(across(Target, factor, levels = unique(Target)))
# #
xsyl <- xsy %>% pivot_longer(where(is.numeric), names_to = "Predictors", values_to = "Values") %>%
mutate(across(Predictors, ~ factor(., levels = unique(Predictors))))
# #
# #Select All Minutes and All Charges (8)
hh <- xsy %>% select(1, 7:10, 12:15)
#
ttl_hh <- "GGally: Churn: SPLOM: Minutes and Charges (8)"
cap_hh <- "C33P31"
sub_hh <- NULL #"Count of Predictor & Target"
lgd_hh <- "Churn"# #Assumes Column 1 has Target Variable (Factor)
C33 <- hh %>% {
ggpairs(data = ., mapping = aes(colour = Target, fill = Target, alpha = 0.3),
columns = 2:ncol(.),
lower = list(continuous = f_gg_scatter),
diag = list(continuous = f_gg_density)) +
labs(caption = cap_hh, subtitle = sub_hh, title = ttl_hh)
}
assign(cap_hh, C33)
rm(C33)# #IN: hh, cap_hh, ttl_hh, loc_png
if(!file.exists(loc_png)) {
png(filename = loc_png, width = k_width, height = k_height, units = "in", res = 144)
pairs.panels(hh[2:ncol(hh)], smooth = FALSE, jiggle = TRUE, rug = FALSE, ellipses = FALSE,
bg = rev(viridis(2))[hh$Target], pch = 21, lwd = 1, cex.cor = 1, cex = 1,
gap = 0, main = ttl_hh)
#title(main = ttl_hh, line = 2, adj = 0)
title(sub = cap_hh, line = 4, adj = 1)
C33 <- recordPlot()
dev.off()
assign(cap_hh, C33)
rm(C33)
}# #For Reference Only. Only add Histogram at the Diagonal to the Base pairs()
PerformanceAnalytics::chart.Correlation(hh[2:ncol(hh)], histogram = TRUE)Figure 41.20 corrplot::corrplot vs. psych::corPlot()
10 or 11 Dimensions are enough for the Universe. How many are needed for your data!
Please import the "C34-cadata.txt".
set.seed(3) #wwww
# #sample() and its variants can be used for Partitioning of Dataset
if(FALSE) {
# #This approach is difficult to extend for 3 or more splits
# #Further, Using floor() multiple times might result in loss of a row
train_idx <- sample.int(n = nrow(bb), size = floor(0.8 * nrow(bb)), replace = FALSE)
train_idx <- sample(seq_len(nrow(bb)), size = floor(0.8 * nrow(bb)), replace = FALSE)
train_bb <- bb[train_idx, ]
test_bb <- bb[-train_idx, ]
}
#
# #For 3 or more splits
#brk_bb = c(train = 0.8, test = 0.1, validate = 0.1)
brk_bb = c(train = 0.9, test = 0.1)
idx_bb = sample(cut(seq_len(nrow(bb)), nrow(bb) * cumsum(c(0, brk_bb)), labels = names(brk_bb)))
#
# #Splits by Fixed Numbers not Percentages
if(FALSE) {
brk_bb = c(train = 18570, test = nrow(bb))
idx_bb = sample(cut(seq_len(nrow(bb)), c(0, brk_bb), labels = names(brk_bb)))
}
# #List of Multiple Tibbles
part_l = split(bb, idx_bb)
#
# #nrow(), ncol(), dim() can be applied
vapply(part_l, nrow, FUN.VALUE = integer(1))
## train test
## 18576 2064
stopifnot(identical(nrow(bb), sum(vapply(part_l, nrow, FUN.VALUE = integer(1)))))| Keys | Min | Max | SD | Mean | Median | Mode | Unique | isNA |
|---|---|---|---|---|---|---|---|---|
| House Value (Median) | 14999 | 500001 | 115304.56 | 206928.9 | 179700 | 500001 | 3767 | 0 |
| Income (Median) | 0.5 | 15 | 1.9 | 3.87 | 3.53 | 15 | 11952 | 0 |
| House Age (Median) | 1 | 52 | 12.57 | 28.6 | 29 | 52 | 52 | 0 |
| Rooms (Total) | 2 | 39320 | 2191.95 | 2641.99 | 2127 | 1613 | 5733 | 0 |
| Bedrooms (Total) | 1 | 6445 | 425.22 | 539.53 | 435 | 280 | 1886 | 0 |
| Population | 3 | 35682 | 1140.33 | 1427.82 | 1166 | 850 | 3781 | 0 |
| Households | 1 | 6082 | 385.55 | 501.02 | 410 | 306 | 1773 | 0 |
| Latitude | 32.5 | 42 | 2.13 | 35.62 | 34.25 | 34.1 | 851 | 0 |
| Longitude | -124.3 | -114 | 2 | -119.56 | -118.49 | -118.3 | 832 | 0 |
Figure 42.1 Houses: Boxplot (Scaled)
Note that normality of the data is not strictly required to perform non-inferential PCA but that strong departures from normality may diminish the observed correlations. As data mining applications usually do not involve inference, we will not worry about normality.
Figure 42.2 Houses: SPLOM (8)
Figure 42.3 Houses: Corrplot (8)
| income | h_age | rooms | bedrooms | population | households | latitude | longitude | |
|---|---|---|---|---|---|---|---|---|
| income | ||||||||
| h_age | -0.117 | |||||||
| rooms | 0.193 | -0.36 | ||||||
| bedrooms | -0.013 | -0.32 | 0.93 | |||||
| population | 0.001 | -0.29 | 0.86 | 0.88 | ||||
| households | 0.008 | -0.3 | 0.92 | 0.98 | 0.9 | |||
| latitude | -0.078 | 0.01 | -0.04 | -0.07 | -0.1 | -0.07 | ||
| longitude | -0.017 | -0.11 | 0.05 | 0.07 | 0.1 | 0.06 | -0.9 |
# #cor() produces a Matrix
ii <- cor(zw)
str(ii)
## num [1:8, 1:8] 1 -0.11702 0.192888 -0.012576 0.000955 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:8] "income" "h_age" "rooms" "bedrooms" ...
## ..$ : chr [1:8] "income" "h_age" "rooms" "bedrooms" ...
ii
## income h_age rooms bedrooms population households latitude
## income 1.000000000 -0.11701992 0.19288773 -0.01257616 0.000954879 0.008423239 -0.07844076
## h_age -0.117019916 1.00000000 -0.35816935 -0.31683614 -0.292810501 -0.299390249 0.01077015
## rooms 0.192887732 -0.35816935 1.00000000 0.93037756 0.857256554 0.919196762 -0.03702508
## bedrooms -0.012576164 -0.31683614 0.93037756 1.00000000 0.877848495 0.980039774 -0.06683883
## population 0.000954879 -0.29281050 0.85725655 0.87784849 1.000000000 0.907065416 -0.10872704
## households 0.008423239 -0.29939025 0.91919676 0.98003977 0.907065416 1.000000000 -0.07121960
## latitude -0.078440760 0.01077015 -0.03702508 -0.06683883 -0.108727036 -0.071219604 1.00000000
## longitude -0.016593996 -0.10712695 0.04526591 0.06873347 0.100075888 0.055907965 -0.92508077
## longitude
## income -0.01659400
## h_age -0.10712695
## rooms 0.04526591
## bedrooms 0.06873347
## population 0.10007589
## households 0.05590797
## latitude -0.92508077
## longitude 1.00000000
#
# #We can eliminate Lower Triangle and Diagonal. However NA does not print well with format()
# #outcome of upper.tri() is easily compared to as.table(). lower.tri() will need extra step
#
# #Take advatage of Matrix Triangle and Set to 0 for later handling by format()
# #IF we remove the diagonal then dimensions gets haywire i.e. 8 to 7 columns left [28] elements
# #IF we keep the diagonal then dimensions gets haywire i.e. 8 to 9 columns left [36] elements
# #So, cannot use NA, has to use ZERO (So that, later, format can replace it.)
#
# #However, we finally went ahead with dplyr solution which handled NA separately from format()
# #Thus eliminating need of assigning 0, NA are being used for redundant triangle and diagonal
#
kk <- ii #ii is FULL 8x8 Matrix
kk[upper.tri(kk, diag = TRUE)] <- NA
mm <- kk %>% as.table() %>% as_tibble(.name_repair = 'unique') %>% drop_na() %>%
filter(n > 0.5 | n < -0.5) %>% arrange(desc(abs(n)))
#
kk <- ii #ii is FULL 8x8 Matrix
nn <- kk %>% as.table() %>% as_tibble(.name_repair = 'unique') %>%
filter(...1 != ...2) %>%
filter(!duplicated(paste0(pmax(as.character(...1), as.character(...2)),
pmin(as.character(...1), as.character(...2))))) %>%
filter(n > 0.5 | n < -0.5) %>% arrange(desc(abs(n)))
stopifnot(identical(mm, nn))
#
# #However, above are long because of the usage of as.table(). For Wide:
ll <- kk %>% #as.table() %>%
as_tibble() %>%
mutate(ID = row_number())
# #Get Names
oo <- names(ll)
ll %>% mutate(across(-ID, ~ ifelse(ID <= match(cur_column(), oo), NA, .x))) %>% select(-ID)
## # A tibble: 8 x 8
## income h_age rooms bedrooms population households latitude longitude
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
## 1 NA NA NA NA NA NA NA NA
## 2 -0.117 NA NA NA NA NA NA NA
## 3 0.193 -0.358 NA NA NA NA NA NA
## 4 -0.0126 -0.317 0.930 NA NA NA NA NA
## 5 0.000955 -0.293 0.857 0.878 NA NA NA NA
## 6 0.00842 -0.299 0.919 0.980 0.907 NA NA NA
## 7 -0.0784 0.0108 -0.0370 -0.0668 -0.109 -0.0712 NA NA
## 8 -0.0166 -0.107 0.0453 0.0687 0.100 0.0559 -0.925 NAf_pKblM <- function(x, caption, isTri = TRUE, negPos = c(-0.0000001, 0.0000001), dig = 1L, ...) {
# #Description:
# Prints Kable Matrix Standard Format: f_pKblM(hh, cap_hh)
# Calls: f_pKbl()
# #Arguments:
# x: Matrix
# caption: Table Title with Table Number in "(AXXTYY)" Format
# isTri: When TRUE (Default) prints complete Matrix otherwise Lower Triangle Only
# negPos: Vector of 2 values, to apply 3 colours to labels
# dig: Number of decimal places
# ... : Everything else is passed to f_pKbl()
#
stopifnot(identical(length(negPos), 2L))
#
# #outcome of upper.tri() is easily compared to as.table(). lower.tri() will need extra step
if(isTri) x[upper.tri(x, diag = TRUE)] <- NA
#
# #Suppress Warnings because 1 column is completely NA on which mutate(across()) is applied
# #Keeping the column is better to be seen as Matrix in this specific case of Correlation Matrix
# #Warning messages: no non-missing arguments to min; returning Inf
# #Warning messages: no non-missing arguments to max; returning -Inf
#
x <- suppressWarnings(x %>%
# #Using as.table() gives long, otherwise wide
#as.table() %>%
as_tibble(rownames = NA, .name_repair = 'unique') %>%
# #Value based conditional formatting needs to happen before kbl() is called because
# #mutate() does not work on kbl
# #format() needs to be called inside cell_spec() itself
# #format cannot be applied later because once the value is modified for kbl() it becomes numeric
# #format cannot be applied before because it changes the value to character
mutate(across(everything(),
~ cell_spec(ifelse(is.na(.x), "",
format(.x, digits = dig, scientific = FALSE, drop0trailing = TRUE)),
# #Change na_font_size to 1 or higher number to see bigger visual blobs on NA
font_size = spec_font_size(abs(.x), na_font_size = 0),
color = ifelse(.x < 0 | is.na(.x), "black", "black"),
background = case_when(is.na(.x) ~ "black",
.x < negPos[1] ~ "#D8B365",
.x >= negPos[2] ~ "#5AB4AC",
TRUE ~ "grey")))))
result <- f_pKbl(x, caption = caption, ...)
return(result)
# #xxCLOSE: f_pKblM()
}f_pKbl <- function(x, caption, headers = names(x), debug = FALSE, maxrows = 30L) {
# #Print Kable Standard Formats: f_pKbl(hh, cap_hh, headers = names_hh, debug = TRUE)
# #Kable Prints FULL DATASET passed to it.
# #names() does NOT work on Matrices but colnames() works
# #even though names() is superior to colnames() in all other aspects
# #We can do a conditional check on type ane then call relevant function but for now
# #Supply colnames() manually if using Matrices
#
if(nrow(x) > maxrows) {
#Print only the Head of Big Datasets by checking if it has more rows than maxrows
x <- head(x)
}
txt_colour <- ifelse(debug, "black", "white")
result <- kbl(x,
caption = cap_hh,
col.names = headers,
escape = FALSE, align = "c", booktabs = TRUE
) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
html_font = "Consolas", font_size = 12,
full_width = FALSE,
#position = "float_left",
fixed_thead = TRUE
) %>%
# #Header Row Dark & Bold: RGB (48, 48, 48) =HEX (#303030)
row_spec(0, color = "white", background = "#303030", bold = TRUE,
extra_css = "border-bottom: 1px solid; border-top: 1px solid"
) %>% row_spec(row = 1:nrow(x), color = txt_colour)
return(result)
# #xxCLOSE: f_pKbl()
}# #Perform PCA by prcomp() #wwww
#ii <- princomp(zw)
pca_zw <- prcomp(zw)
#
names(pca_zw)
## [1] "sdev" "rotation" "center" "scale" "x"
#
# #Principal components have "loadings" i.e. $rotation and "scores" i.e. $x
# #Loadings specify the weight that each variable contributes to the principal component.
# #Scores show the value each sample has on each principal component.
#
dim(pca_zw$rotation)
## [1] 8 8
dim(pca_zw$x)
## [1] 18576 8
#
# #Matrix Multiplication i.e. %*% of original variables with loadings gives scores
bb <- as.matrix(zw) %*% pca_zw$rotation
all.equal(bb, pca_zw$x)
## [1] TRUE
identical(round(bb, 5), round(pca_zw$x, 5))
## [1] TRUE
#
summary(pca_zw)$importance
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.975794 1.381294 1.035965 0.9078472 0.3852156 0.2846245 0.2162878 0.1211411
## Proportion of Variance 0.487970 0.238500 0.134150 0.1030200 0.0185500 0.0101300 0.0058500 0.0018300
## Cumulative Proportion 0.487970 0.726470 0.860620 0.9636400 0.9821900 0.9923200 0.9981700 1.0000000
#
pca_eigen <- summary(pca_zw)$importance %>% t() %>% as_tibble(rownames = "PCA") %>%
rename(SD = 2, pVar = 3, pVarCum = 4) %>%
mutate(EigenVal = SD^2, pVarManual = EigenVal/sum(EigenVal),
isOne = ifelse(EigenVal > 1, "Yes", "No"),
isNinty = ifelse(pVarCum < 0.9, "Yes", "No"))| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | |
|---|---|---|---|---|---|---|---|---|
| Income (Median) | -0.04 | 0.03 | -0.89 | 0.41 | 0.06 | -0.06 | -0.17 | -0.041 |
| House Age (Median) | 0.22 | -0.02 | 0.4 | 0.88 | -0.03 | 0.09 | -0.04 | -0.004 |
| Rooms (Total) | -0.48 | -0.07 | -0.09 | 0.11 | -0.32 | 0.56 | 0.55 | 0.152 |
| Bedrooms (Total) | -0.49 | -0.06 | 0.12 | 0.06 | -0.38 | -0.23 | -0.22 | -0.703 |
| Population | -0.47 | -0.03 | 0.11 | 0.08 | 0.85 | 0.13 | -0.02 | -0.133 |
| Households | -0.49 | -0.06 | 0.11 | 0.1 | -0.14 | -0.4 | -0.31 | 0.678 |
| Latitude | 0.07 | -0.7 | -0.01 | -0.1 | -0.05 | 0.47 | -0.52 | 0.035 |
| Longitude | -0.08 | 0.7 | 0.05 | -0.07 | -0.1 | 0.48 | -0.5 | 0.048 |
# #Perform PCA by prcomp() #wwww
#ii <- princomp(zw)
pca_zw <- prcomp(zw)
#
names(pca_zw)
## [1] "sdev" "rotation" "center" "scale" "x"
#
# #Principal components have "loadings" i.e. $rotation and "scores" i.e. $x
# #Loadings specify the weight that each variable contributes to the principal component.
# #Scores show the value each sample has on each principal component.
#
dim(pca_zw$rotation)
## [1] 8 8
dim(pca_zw$x)
## [1] 18576 8
#
# #Matrix Multiplication i.e. %*% of original variables with loadings gives scores
bb <- as.matrix(zw) %*% pca_zw$rotation
all.equal(bb, pca_zw$x)
## [1] TRUE
identical(round(bb, 5), round(pca_zw$x, 5))
## [1] TRUE
#
summary(pca_zw)$importance
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.975794 1.381294 1.035965 0.9078472 0.3852156 0.2846245 0.2162878 0.1211411
## Proportion of Variance 0.487970 0.238500 0.134150 0.1030200 0.0185500 0.0101300 0.0058500 0.0018300
## Cumulative Proportion 0.487970 0.726470 0.860620 0.9636400 0.9821900 0.9923200 0.9981700 1.0000000
#
pca_eigen <- summary(pca_zw)$importance %>% t() %>% as_tibble(rownames = "PCA") %>%
rename(SD = 2, pVar = 3, pVarCum = 4) %>%
mutate(EigenVal = SD^2, pVarManual = EigenVal/sum(EigenVal),
isOne = ifelse(EigenVal > 1, "Yes", "No"),
isNinty = ifelse(pVarCum < 0.9, "Yes", "No"))#wwww
# #Component Matrix of prcomp() does not match with either BOOK or psych::principal()
# #prcomp() rotation contains eigenvectors not loadings.
# #Loadings = Eigenvectors * sqrt(Eigenvalues) = Eigenvectors * sdev
#
# #psych::principal()
psy_zw <- principal(zw, nfactors = ncol(zw), rotate = 'none', scores = TRUE)
names(psy_zw)
## [1] "values" "rotation" "n.obs" "communality" "loadings" "fit"
## [7] "fit.off" "fn" "Call" "uniquenesses" "complexity" "chi"
## [13] "EPVAL" "R2" "objective" "residual" "rms" "factors"
## [19] "dof" "null.dof" "null.model" "criteria" "STATISTIC" "PVAL"
## [25] "weights" "r.scores" "Vaccounted" "Structure" "scores"
#
#psy_zw$loadings
#
# #To Match them Multiply by SD = sqrt(Eigenvalues)
sd_pca <- summary(pca_zw)$sdev
eigen_pca <- sd_pca ^ 2
#
# #Multiply PC1 column with sqrt(Eigenvalue) of PC1 i.e. SD and so on
load_pca <- t(t(pca_zw$rotation) * sd_pca)
#
round(load_pca, 3)
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## income -0.083 0.047 -0.921 0.374 0.021 -0.016 -0.036 -0.005
## h_age 0.428 -0.021 0.413 0.803 -0.013 0.026 -0.009 0.000
## rooms -0.956 -0.103 -0.097 0.104 -0.122 0.159 0.119 0.018
## bedrooms -0.970 -0.084 0.120 0.057 -0.145 -0.066 -0.048 -0.085
## population -0.933 -0.036 0.118 0.074 0.327 0.036 -0.004 -0.016
## households -0.972 -0.088 0.112 0.087 -0.054 -0.114 -0.066 0.082
## latitude 0.145 -0.970 -0.012 -0.089 -0.018 0.133 -0.113 0.004
## longitude -0.150 0.969 0.057 -0.063 -0.037 0.137 -0.109 0.006
round(psy_zw$loadings, 3)
##
## Loadings:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## income 0.921 0.374
## h_age -0.428 -0.413 0.803
## rooms 0.956 0.103 0.104 0.122 0.159 -0.119
## bedrooms 0.970 -0.120 0.145
## population 0.933 -0.118 -0.327
## households 0.972 -0.112 -0.114
## latitude -0.145 0.970 0.133 0.113
## longitude 0.150 -0.969 0.137 0.109
##
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## SS loadings 3.904 1.909 1.072 0.824 0.148 0.081 0.047 0.015
## Proportion Var 0.488 0.239 0.134 0.103 0.019 0.010 0.006 0.002
## Cumulative Var 0.488 0.727 0.861 0.964 0.982 0.992 0.998 1.000Figure 42.4 Houses: PCA Corrplot - ALL are ZERO
# #Correlation Matrix | Table (Long) | Tibble
hh <- cor(pca_zw$x) %>% as.table() %>% as_tibble(.name_repair = 'unique') %>%
#filter(...1 != ...2) %>%
filter(!duplicated(paste0(pmax(as.character(...1), as.character(...2)),
pmin(as.character(...1), as.character(...2))))) #%>%
#mutate(across(where(is.character), factor, levels = c_zsyw[-1], labels = names(c_zsyw)[-1]))
#
ttl_hh <- "Houses: PCA Corrplot - ALL are ZERO"
cap_hh <- "C34P04"Figure 42.5 House: PCA Screeplot with Variance
| PCA | SD | exp_Variance | cum_Var | EigenVal | isEigenOne | isVarNinty |
|---|---|---|---|---|---|---|
| PC1 | 1.976 | 0.48797 | 0.488 | 3.9038 | Yes | Yes |
| PC2 | 1.381 | 0.2385 | 0.726 | 1.908 | Yes | Yes |
| PC3 | 1.036 | 0.13415 | 0.861 | 1.0732 | Yes | Yes |
| PC4 | 0.908 | 0.10302 | 0.964 | 0.8242 | No | No |
| PC5 | 0.385 | 0.01855 | 0.982 | 0.1484 | No | No |
| PC6 | 0.285 | 0.01013 | 0.992 | 0.081 | No | No |
| PC7 | 0.216 | 0.00585 | 0.998 | 0.0468 | No | No |
| PC8 | 0.121 | 0.00183 | 1 | 0.0147 | No | No |
| Total | NA | NA | NA | 8 | NA | NA |
hh <- pca_eigen
#
ttl_hh <- "Houses: PCA Eigenvalue ScreePlot"
cap_hh <- "C34P05"
y_hh <- "Eigenvalue"
# #IN: hh
C34 <- hh %>% {ggplot(., aes(x = PCA, y = EigenVal)) +
geom_point(aes(color = isOne), size = 3) +
geom_line(aes(group = 1)) +
geom_hline(aes(yintercept = 1), color = '#440154FF', linetype = "dashed") +
annotate("segment", x = 3.5, xend = 3.1, y = 1.6,
yend = 1.2, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
annotate("segment", x = 4.5, xend = 4.1, y = 1.3,
yend = 0.9, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
annotate("segment", x = 5.5, xend = 5.1, y = 0.6,
yend = 0.2, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
geom_text(data = tibble(x = c(3.5, 4.5, 5.5), y = c(1.7, 1.4, 0.7),
labels = c("Eigenvalue Criterion", "Screeplot Criterion", "Elbow Point")),
aes(x=x, y=y, label=labels), check_overlap = TRUE) +
scale_fill_distiller(palette = "BrBG") +
#coord_fixed() +
theme(legend.position = 'none') +
labs(y = y_hh, subtitle = NULL, caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, C34)
rm(C34)
ttl_hh <- "Houses: PCA Proportion of Variance Explained"
cap_hh <- "C34P06"
y_hh <- NULL
# #IN: hh
C34 <- hh %>% {ggplot(., aes(x = PCA, y = pVarCum)) +
geom_point(aes(color = isNinty), size = 3) +
geom_line(aes(group = 1)) +
geom_hline(aes(yintercept = 0.9), color = '#440154FF', linetype = "dashed") +
annotate("segment", x = 4, xend = 4, y = 0.83,
yend = 0.93, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
geom_text(data = tibble(x = 5.5, y = 0.8, labels = c("Proportion of Variance Covered Criterion")),
aes(x=x, y=y, label=labels), check_overlap = TRUE) +
scale_fill_distiller(palette = "BrBG") +
scale_y_continuous(limits = c(0, 1), labels=percent) +
theme(legend.position = 'none') +
labs(y = y_hh,
subtitle = NULL, caption = cap_hh, title = ttl_hh)
}
assign(cap_hh, C34)
rm(C34)| PC1 | PC2 | PC3 | PC4 | |
|---|---|---|---|---|
| Income | -0.889 | 0.412 | ||
| Age | 0.216 | 0.399 | 0.885 | |
| Rooms | -0.484 | |||
| Beds | -0.491 | |||
| Pop | -0.472 | |||
| Houses | -0.492 | |||
| Lat | -0.702 | |||
| Long | 0.701 |
Figure 42.6 Correlation Matrices of Factor Scores of PC3 and PC4
| PC1 | PC2 | PC3 | PC4 | |
|---|---|---|---|---|
| Income | -0.921 | |||
| Age | 0.803 | |||
| Rooms | -0.956 | |||
| Beds | -0.97 | |||
| Pop | -0.933 | |||
| Houses | -0.972 | |||
| Lat | -0.97 | |||
| Long | 0.969 |
# #Create a Matrix
# #Default: The Matrix is Filled Column by Column
matrix(1:9, nrow = 3)
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
#
# #Change Behavour to fill the matrix by Row
matrix(1:9, nrow = 3, byrow = TRUE)
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9# #names() does not work on Matrix
mm <- matrix(1:9, ncol = 3, byrow = TRUE)
rownames(mm) <- tail(letters, 3)
colnames(mm) <- head(letters, 3)
mm
## a b c
## x 1 2 3
## y 4 5 6
## z 7 8 9
#
vv <- 1:3
#
# #Matrix Multipley (Deprecated)
# #Multiply the Vector with Matrix Rows i.e. x * 1, y * 2, z * 3
ii <- diag(vv) %*% mm #loss of rownames because it is taken from Left Side Matrix
#
# #Multiply the Vector with Matrix Columns i.e. a * 1, b * 2, c * 3
jj <- mm %*% diag(vv) #loss of colnames because it is taken from Right Side Matrix
#
# #Check Attributes
attributes(ii)$dimnames
## [[1]]
## NULL
##
## [[2]]
## [1] "a" "b" "c"
attributes(jj)$dimnames
## [[1]]
## [1] "x" "y" "z"
##
## [[2]]
## NULL
#
# #Add Missing RowNames or ColNames
rownames(ii) <- rownames(mm)
colnames(jj) <- colnames(mm)
#
# #Coercing by as.integer() will produce a vector not matrix. Use eiher mode() or class()
#ii[] <- as.integer(ii) #Even using [] is NOT coercing to integer
class(ii) <- "integer"
mode(jj) <- "integer"
#
# #Print
ii
## a b c
## x 1 2 3
## y 8 10 12
## z 21 24 27
str(ii)
## int [1:3, 1:3] 1 8 21 2 10 24 3 12 27
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:3] "x" "y" "z"
## ..$ : chr [1:3] "a" "b" "c"
jj
## a b c
## x 1 4 9
## y 4 10 18
## z 7 16 27
str(jj)
## int [1:3, 1:3] 1 4 7 4 10 16 9 18 27
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:3] "x" "y" "z"
## ..$ : chr [1:3] "a" "b" "c"
#
# #Equivalent: sweep() (Deprecated). SLOW, However it keeps the dimnames.
swp_ii <- sweep(mm, MARGIN = 1, vv, `*`)
swp_jj <- sweep(mm, MARGIN = 2, vv, `*`)
#
# #Recommended:
# #Equivalent: R Recycle Vector Column-wise. So double-transpose is needed if multiplying on jj.
# #Double Transpose is FASTEST & retains dimnames. Bonus: This is Commutative.
rec_ii <- mm * vv
com_ii <- vv * mm
rec_jj <- t(t(mm) * vv)
com_jj <- t(vv * t(mm))
all(identical(rec_ii, com_ii), identical(rec_jj, com_jj)) #Commutative
## [1] TRUE
#
all(identical(ii, swp_ii), identical(jj, swp_jj), identical(ii, rec_ii), identical(jj, rec_jj))
## [1] TRUEmm <- matrix(1:9, ncol = 3, byrow = TRUE)
rownames(mm) <- tail(letters, 3)
colnames(mm) <- head(letters, 3)
mm
## a b c
## x 1 2 3
## y 4 5 6
## z 7 8 9
#
# #Square Each Element of the Matrix
mm ** 2 # ** operator is highly obscure and is actually parsed as ^ so use that not this one
## a b c
## x 1 4 9
## y 16 25 36
## z 49 64 81
mm ^ 2
## a b c
## x 1 4 9
## y 16 25 36
## z 49 64 81
stopifnot(identical(mm ^ 2, mm ** 2))| PC1 | PC2 | PC3 | PC4 | Comm_PC3 | Comm_PC4 | |
|---|---|---|---|---|---|---|
| Income | -0.083 | 0.047 | -0.921 | 0.374 | 0.86 | 1 |
| Age | 0.428 | -0.021 | 0.413 | 0.803 | 0.35 | 1 |
| Rooms | -0.956 | -0.103 | -0.097 | 0.104 | 0.93 | 0.95 |
| Beds | -0.97 | -0.084 | 0.12 | 0.057 | 0.96 | 0.97 |
| Pop | -0.933 | -0.036 | 0.118 | 0.074 | 0.89 | 0.89 |
| Houses | -0.972 | -0.088 | 0.112 | 0.087 | 0.97 | 0.97 |
| Lat | 0.145 | -0.97 | -0.012 | -0.089 | 0.96 | 0.97 |
| Long | -0.15 | 0.969 | 0.057 | -0.063 | 0.96 | 0.97 |
The Eigenvalue Criterion recommended 3 components, but did not absolutely reject the 4 component. Also, for small numbers of variables, this criterion can underestimate the best number of components to extract.
The Proportion of Variance Explained Criterion stated that we needed to use 4 components if we wanted to account for >90% of the variability.
The Scree Plot Criterion said not to exceed 4 components.
The Minimum Communality Criterion stated that, if we wanted to keep housing median age in the analysis, we had to extract 4 components.
Conclusion: PC4 is included.
Test Dataset
Factor Analysis (FA) is related to PCA but the two methods have different goals.
Principal components seek to identify orthogonal linear combinations of the variables, to be used either for descriptive purposes or to substitute a smaller number of uncorrelated components for the original variables.
In contrast, factor analysis represents a model for the data, and as such is more elaborate.
The factor analysis model hypothesizes that the response vector \({\{X_1, X_2, \ldots, X_m\}}\) can be modeled as linear combinations of a smaller set of \({k}\) unobserved, “latent” random variables \({\{F_1, F_2, \ldots, F_k\}}\) called common factors, along with an error term \(\mathbf{\epsilon} = {\{\epsilon_1, \epsilon_2, \ldots, \epsilon_m\}}\). Specifically, the factor analysis model is :
\[\underset{m \times 1}{\mathbf{X - \mu}} = \underset{m \times k}{\mathbf{L}} \, \underset{k \times 1}{\mathbf{F}} + \underset{m \times 1}{\mathbf{\epsilon}} \tag{42.1}\]
Where \(\underset{m \times 1}{\mathbf{X - \mu}}\) is the response vector, centered by the mean vector, \(\underset{m \times k}{\mathbf{L}}\) is the matrix of factor loadings, with \(l_{ij}\) representing the factor loading of the \(i^{\text{th}}\) variable on the \(j^{\text{th}}\) factor, \(\underset{k \times 1}{\mathbf{F}}\) represents the vector of unobservable common factors, and \(\underset{m \times 1}{\mathbf{\epsilon}}\) represents the error vector.
The factor analysis model differs from other models, such as the linear regression model, in that the predictor variables \({\{F_1, F_2, \ldots, F_k\}}\) are unobservable. Because so many terms are unobserved, further assumptions must be made before we may uncover the factors from the observed responses alone.
These assumptions are that \(E(\mathbf{F}) = \mathbf{0}, \text{Cov}(\mathbf{F}) = \mathbf{I}, E(\mathbf{\epsilon}) = \mathbf{0}, \text{Cov}(\mathbf{\epsilon})\) is a diagonal matrix.
Unfortunately, the factor solutions provided by factor analysis are invariant to transformations. Hence, the factors uncovered by the model are in essence nonunique, without further constraints. This indistinctness provides the motivation for factor rotation.
Please import the "C34-adult.csv".
# #Merge Tibbles with ID Names in Column
# #NA Introduced by changing Question Mark to NA
aa <- bind_rows(Train = tbl_aa, Test = tbl_bb, .id = 'source')
#
bb <- aa %>%
select(-c(fnlwgt, education, capital_gain, capital_loss)) %>%
mutate(Income = ifelse(Income == "<=50K" | Income == "<=50K.", "0", "1")) %>%
mutate(across(c(workclass, occupation, native_country), ~na_if(., "?"))) %>%
mutate(native_country = ifelse(str_detect(native_country,
paste0(c("United-", "Outlying-US"), collapse = "|")), "USA", "Other")) %>%
mutate(across(where(is.character), ~ factor(., levels = unique(.)))) %>%
mutate(marital_status =
fct_collapse(marital_status,
Married = c("Married-civ-spouse", "Married-spouse-absent", "Married-AF-spouse")))
#
xxC34Adult <- bb
f_setRDS(xxC34Adult)# #Levels of each Factor Variable
#lapply(aa[, sapply(aa, is.factor)], levels)
#levels(aa$marital_status)
summary(aa$marital_status)
ii <- aa %>% select(marital_status) %>%
mutate(marital_status =
fct_collapse(marital_status,
Married = c("Married-civ-spouse", "Married-spouse-absent", "Married-AF-spouse")))
#summary(ii$marital_status)# #Check Numeric Columns by summary()
if(TRUE) aa %>% select(!where(is.numeric)) %>% summary()
if(TRUE) sort(unique(aa$age))
if(TRUE) length(sort(unique(aa$age)))
if(TRUE) aa %>% count(age) %>% mutate(PROP = n/sum(n)) #%>% arrange(desc(n)) %>% head(10)
#
# #Find Missing Numbers in a Sequence of Numbers
#ii <- unique(aa$age)
ii <- unique(aa$hours_per_week)
jj <- min(ii):max(ii)
jj[!jj %in% ii]
#
# #Equivalent
setdiff(jj, ii)# #To Search For Question Mark in All Factor Columns, Question Mark needs to be escaped
# #The Backslash used for escaping itself needs to be escaped using Backslash
aa %>% rowwise() %>%
mutate(find_me = any(str_detect(c_across(where(is.factor)),
regex("\\?", ignore_case = TRUE)), na.rm = TRUE)) %>%
filter(find_me)
#
# #To Get the Column Names containing a String i.e. '?'
which(vapply(aa, function(x) any(stri_detect(x, regex = "\\?", max_count = 1)), logical(1)))# #Search and Replace for Multiple Partial Matches
# #NOTE: "|" should be used to collpase NOT " | "
# #NOTE: Question Marks Replaced as Other
aa %>% mutate(native_country = ifelse(str_detect(native_country,
paste0(c("United-", "Outlying-US"), collapse = "|")), "USA", "Other")) %>%
count(native_country)# #Check Factor Columns by summary()
if(TRUE) aa %>% select(!where(is.factor)) %>% summary()
ii <- factor(aa$native_country)
if(TRUE) levels(ii)
if(TRUE) nlevels(ii)
aa %>% count(native_country) #%>% tail(10) Note that the correlations, although statistically significant in several cases, are overall much weaker than the correlations from the ‘houses’ data set. A weaker correlation structure should pose more of a challenge for the dimension-reduction method.
NOTE: While the Book created ‘Net Captial,’ I have skipped that because Capital Gain and Capital Loss Columns have extremely high number of zeroes. Further, an ID column ‘fnlwgt’ was also dropped.
| age | edu | hours | |
|---|---|---|---|
| age | |||
| edu | 0.0363 | ||
| hours | 0.0699 | 0.15 |
Factor analysis requires a certain level of correlation in order to function appropriately. The following tests have been developed to ascertain whether there exists sufficiently high correlation to perform factor analysis.
Note, however, that statistical tests in the context of huge data sets can be misleading. With huge sample sizes, even the smallest effect sizes become statistically significant. This is why data mining methods rely on cross-validation methodologies, not statistical inference.
The proportion of variability within the standardized predictor variables which is shared in common, and therefore might be caused by underlying factors, is measured by the Kaiser–Meyer–Olkin (KMO) Measure of Sampling Adequacy.
Bartlett Test of Sphericity tests the null hypothesis that the correlation matrix is an identity matrix, that is, that the variables are really uncorrelated.
The statistic reported is the p-value, so that very small values would indicate evidence against the null hypothesis, that is, the variables really are correlated.
For p-values much larger than 0.10, there is insufficient evidence that the variables are correlated, and so factor analysis may not be suitable.
It compares an observed correlation matrix to the identity matrix.
The KMO statistic has a value of 0.52, which is not less than 0.5, meaning that this test does not find the level of correlation to be too low for factor analysis.
The p-value for Bartlett Test of Sphericity rounds to zero, so that the null hypothesis that no correlation exists among the variables is rejected. We therefore proceed with the factor analysis.
# #KMO Test: Measure of Sampling Adequacy (MSA)
KMO(cor(adl_zw))
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor(adl_zw))
## Overall MSA = 0.52
## MSA for each item =
## age edu hours
## 0.56 0.51 0.51# #Bartlett Test of Sphericity
bartsph <- cortest.bartlett(cor(adl_zw), n = nrow(adl_zw))
if(bartsph$p.value < 0.05) {
cat("Null Rejected. Variables are Correlated. Dimension Reduction can be performed.\n")
} else {
cat("Failed to reject the Null. Uncorrelated Variables. No benefit in Dimension Reduction.\n")
}
## Null Rejected. Variables are Correlated. Dimension Reduction can be performed.To allow us to view the results using a scatter plot, we decide a priori to extract only two factors.
The following factor analysis is performed using the principal axis factoring option.
fa()
Warning: The estimated weights for the factor scores are probably incorrect. Try a different factor score estimation method.
Warning: An ultra-Heywood case was detected. Examine the results carefully
Add ‘SMC = FALSE’ in fa()
NOTE
adl_fa <- fa(adl_zw, nfactors = 2, #ncol(adl_zw)
fm = "pa", rotate = "none", SMC = FALSE)
# #Loadings
adl_fa$loadings
##
## Loadings:
## PA1 PA2
## age 0.983
## edu 0.633
## hours 0.225
##
## PA1 PA2
## SS loadings 0.982 0.457
## Proportion Var 0.327 0.152
## Cumulative Var 0.327 0.480
#
# #Values
adl_fa$values
## [1] 0.9817285949 0.4567784008 0.0007492974
#
# #Communalities
adl_fa$communalities
## [1] 1.438507# #Warnings
# #The estimated weights for the factor scores are probably incorrect.
# #Try a different factor score estimation method.
# #An ultra-Heywood case was detected. Examine the results carefully
fa(adl_zw, nfactors = 2, #ncol(adl_zw)
fm = "pa", rotate = "none", SMC = FALSE)
## Factor Analysis using method = pa
## Call: fa(r = adl_zw, nfactors = 2, rotate = "none", SMC = FALSE, fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
## PA1 PA2 h2 u2 com
## age 0.98 -0.08 0.972 0.028 1.0
## edu 0.09 0.63 0.408 0.592 1.0
## hours 0.09 0.23 0.058 0.942 1.3
##
## PA1 PA2
## SS loadings 0.98 0.46
## Proportion Var 0.33 0.15
## Cumulative Var 0.33 0.48
## Proportion Explained 0.68 0.32
## Cumulative Proportion 0.68 1.00
##
## Mean item complexity = 1.1
## Test of the hypothesis that 2 factors are sufficient.
##
## The degrees of freedom for the null model are 3 and the objective function was 0.03 with Chi Square of 706.35
## The degrees of freedom for the model are -2 and the objective function was 0
##
## The root mean square of the residuals (RMSR) is 0
## The df corrected root mean square of the residuals is NA
##
## The harmonic number of observations is 25000 with the empirical chi square 0 with prob < NA
## The total number of observations was 25000 with Likelihood Chi Square = 0 with prob < NA
##
## Tucker Lewis Index of factoring reliability = 1.004
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy
## PA1 PA2
## Correlation of (regression) scores with factors 0.98 0.66
## Multiple R square of scores with factors 0.97 0.43
## Minimum correlation of possible factor scores 0.94 -0.14# #Warnings
# #The estimated weights for the factor scores are probably incorrect.
# #Try a different factor score estimation method.
# #An ultra-Heywood case was detected. Examine the results carefully
fa(cor(adl_zw), nfactors = 2, #ncol(adl_zw)
fm = "pa", rotate = "none", n.obs = 25000, SMC = FALSE)
## Factor Analysis using method = pa
## Call: fa(r = cor(adl_zw), nfactors = 2, n.obs = 25000, rotate = "none",
## SMC = FALSE, fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
## PA1 PA2 h2 u2 com
## age 0.98 -0.08 0.972 0.028 1.0
## edu 0.09 0.63 0.408 0.592 1.0
## hours 0.09 0.23 0.058 0.942 1.3
##
## PA1 PA2
## SS loadings 0.98 0.46
## Proportion Var 0.33 0.15
## Cumulative Var 0.33 0.48
## Proportion Explained 0.68 0.32
## Cumulative Proportion 0.68 1.00
##
## Mean item complexity = 1.1
## Test of the hypothesis that 2 factors are sufficient.
##
## The degrees of freedom for the null model are 3 and the objective function was 0.03 with Chi Square of 706.35
## The degrees of freedom for the model are -2 and the objective function was 0
##
## The root mean square of the residuals (RMSR) is 0
## The df corrected root mean square of the residuals is NA
##
## The harmonic number of observations is 25000 with the empirical chi square 0 with prob < NA
## The total number of observations was 25000 with Likelihood Chi Square = 0 with prob < NA
##
## Tucker Lewis Index of factoring reliability = 1.004
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy
## PA1 PA2
## Correlation of (regression) scores with factors 0.98 0.66
## Multiple R square of scores with factors 0.97 0.43
## Minimum correlation of possible factor scores 0.94 -0.14# #No Warning
fa(adl_zw, nfactors = ncol(adl_zw), fm = "minres", rotate = "none")
## Factor Analysis using method = minres
## Call: fa(r = adl_zw, nfactors = ncol(adl_zw), rotate = "none", fm = "minres")
## Standardized loadings (pattern matrix) based upon correlation matrix
## MR1 MR2 MR3 h2 u2 com
## age 0.15 0.14 0 0.042 0.96 2.0
## edu 0.36 -0.12 0 0.144 0.86 1.2
## hours 0.43 0.05 0 0.187 0.81 1.0
##
## MR1 MR2 MR3
## SS loadings 0.34 0.04 0.00
## Proportion Var 0.11 0.01 0.00
## Cumulative Var 0.11 0.12 0.12
## Proportion Explained 0.90 0.10 0.00
## Cumulative Proportion 0.90 1.00 1.00
##
## Mean item complexity = 1.4
## Test of the hypothesis that 3 factors are sufficient.
##
## The degrees of freedom for the null model are 3 and the objective function was 0.03 with Chi Square of 706.35
## The degrees of freedom for the model are -3 and the objective function was 0
##
## The root mean square of the residuals (RMSR) is 0
## The df corrected root mean square of the residuals is NA
##
## The harmonic number of observations is 25000 with the empirical chi square 0 with prob < NA
## The total number of observations was 25000 with Likelihood Chi Square = 0 with prob < NA
##
## Tucker Lewis Index of factoring reliability = 1.004
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy
## MR1 MR2 MR3
## Correlation of (regression) scores with factors 0.54 0.20 0
## Multiple R square of scores with factors 0.29 0.04 0
## Minimum correlation of possible factor scores -0.43 -0.92 -1To assist in the interpretation of the factors, factor rotation may be performed. Factor rotation corresponds to a transformation (usually orthogonal) of the coordinate axes, leading to a different set of factor loadings.
The sharpest focus occurs when each variable has high factor loadings on a single factor, with low-to-moderate loadings on the other factors.
“ForLater” Figure 4.6 Page 115 - Biplot
No significant difference was observed with different Rotations, unlike BOOK.
# #Huge number of Rotations are available including "none", "varimax", "quartimax", "equamax" ...
# #No significant difference observed
fa(adl_zw, nfactors = 2, fm = "pa", rotate = "none", SMC = FALSE)$loadings
##
## Loadings:
## PA1 PA2
## age 0.983
## edu 0.633
## hours 0.225
##
## PA1 PA2
## SS loadings 0.982 0.457
## Proportion Var 0.327 0.152
## Cumulative Var 0.327 0.480
#
fa(adl_zw, nfactors = 2, fm = "pa", rotate = "varimax", SMC = FALSE)$loadings
##
## Loadings:
## PA1 PA2
## age 0.982
## edu 0.638
## hours 0.236
##
## PA1 PA2
## SS loadings 0.968 0.471
## Proportion Var 0.323 0.157
## Cumulative Var 0.323 0.480
#
fa(adl_zw, nfactors = 2, fm = "pa", rotate = "quartimax", SMC = FALSE)$loadings
##
## Loadings:
## PA1 PA2
## age 0.986
## edu 0.638
## hours 0.232
##
## PA1 PA2
## SS loadings 0.978 0.461
## Proportion Var 0.326 0.154
## Cumulative Var 0.326 0.480
#
fa(adl_zw, nfactors = 2, fm = "pa", rotate = "equamax", SMC = FALSE)$loadings
##
## Loadings:
## PA1 PA2
## age 0.986
## edu 0.638
## hours 0.232
##
## PA1 PA2
## SS loadings 0.977 0.461
## Proportion Var 0.326 0.154
## Cumulative Var 0.326 0.480“Univariate Statistical Analysis (335)” was a summary view of Hypothesis Testing.
“Multivariate Statistics (336)” was a summary view of ANOVA, Goodness of Fit etc.
42.3 Overfitting is the production of an analysis that corresponds too closely to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably.
For example, suppose we report that “only” 28.4% of customers adopting our International Plan will churn. That does not sound too bad, until we recall that, among all of our customers, the overall churn rate is only 14.49%. This overall churn rate may be considered our baseline, against which any further results can be calibrated. Thus, belonging to the International Plan actually nearly doubles the churn rate, which is clearly not good.
For example, suppose the algorithm your analytics company currently uses succeeds in identifying 90% of all fraudulent online transactions. Then, your company will probably expect your new data mining model to outperform this 90% baseline.
Refer k-means Algorithm
Refer Pseudo F-Statistic
One potential problem for applying the k-means algorithm is: Who decides how many clusters to search for i.e. who decides k
What if some attributes are more relevant than others to the problem formulation
“ForLater” - Nothing groundbreaking there for now.
“ForLater” - NOTE that Pseudo F-Statistic prefer k=3 in contrast to the Silhouette which preferred k=2.
The a priori algorithm takes advantage of the a priori property to shrink the search space.
“ForLater” - Apply a priori on Adult Dataset
“ForLater” - “Association Rules are easy to do badly” - Example: Adult Dataset
Not all association rules are equally useful. Thus Lift can be used to quantify its usefulness.
# #SUMMARISED Packages and Objects (BOOK CHECK)
f_()
## [1] "color_base, color_sort, color_uniq, fruit_base, fruit_sort, fruit_uniq"
#
difftime(Sys.time(), k_start)
## Time difference of 1.320141 mins1.1: Vectors
Vectors are the simplest type of data structure in R. A vector is a sequence of data elements of the same basic type.
1.2: Components
Members of a vector are called components.
1.3: Packages
Packages are the fundamental units of reproducible R code.
2.1: R-Markdown
R Markdown is a file format for making dynamic documents with R.
2.2: NA
NA is a logical constant of length 1 which contains a missing value indicator.
2.3: Factors
Factors are the data objects which are used to categorize the data and store it as levels.
2.4: Lists
Lists are by far the most flexible data structure in R. They can be seen as a collection of elements without any restriction on the class, length or structure of each element.
2.5: DataFrame
Data Frames are lists with restriction that all elements of a data frame are of equal length.
7.1: H-Variances
\(\text{\{Variances\}} {H_0} : {\sigma}_1 = {\sigma}_2 = \dots = {\sigma}_k \iff {H_a}: \text{At least two variances differ.}\)
8.1: Imputation
Imputation is the process of replacing missing data with substituted values. Imputation preserves all cases by replacing missing data with an estimated value based on other available information.
15.1: Linear-Regression
Linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables.
21.1: Data
Data are the facts and figures collected, analysed, and summarised for presentation and interpretation.
21.2: Elements
Elements are the entities on which data are collected. (Generally ROWS)
21.3: Variable
A variable is a characteristic of interest for the elements. (Generally COLUMNS)
21.4: Observation
The set of measurements obtained for a particular element is called an observation.
21.5: Statistics
Statistics is the art and science of collecting, analysing, presenting, and interpreting data.
21.6: Scale-of-Measurement
The scale of measurement determines the amount of information contained in the data and indicates the most appropriate data summarization and statistical analyses.
21.7: Nominal-Scale
When the data for a variable consist of labels or names used to identify an attribute of the element, the scale of measurement is considered a nominal scale.
21.8: Ordinal-Scale
The scale of measurement for a variable is considered an ordinal scale if the data exhibit the properties of nominal data and in addition, the order or rank of the data is meaningful.
21.9: Interval-Scale
The scale of measurement for a variable is an interval scale if the data have all the properties of ordinal data and the interval between values is expressed in terms of a fixed unit of measure.
21.10: Ratio-Scale
The scale of measurement for a variable is a ratio scale if the data have all the properties of interval data and the ratio of two values is meaningful.
21.11: Categorical-Data
Data that can be grouped by specific categories are referred to as categorical data. Categorical data use either the nominal or ordinal scale of measurement.
21.12: Quantitative-Data
Data that use numeric values to indicate ‘how much’ or ‘how many’ are referred to as quantitative data. Quantitative data are obtained using either the interval or ratio scale of measurement.
21.13: Discrete
Quantitative data that measure ‘how many’ are discrete.
21.14: Continuous
Quantitative data that measure ‘how much’ are continuous because no separation occurs between the possible data values.
21.15: Cross-Sectional-Data
Cross-sectional data are data collected at the same or approximately the same point in time.
21.16: Time-Series-Data
Time-series data data are data collected over several time periods.
21.17: Observational-Study
In an observational study we simply observe what is happening in a particular situation, record data on one or more variables of interest, and conduct a statistical analysis of the resulting data.
21.18: Experiment
The key difference between an observational study and an experiment is that an experiment is conducted under controlled conditions.
21.19: Descriptive-Statistics
Most of the statistical information is summarized and presented in a form that is easy to understand. Such summaries of data, which may be tabular, graphical, or numerical, are referred to as descriptive statistics.
21.20: Population
A population is the set of all elements of interest in a particular study.
21.21: Sample
A sample is a subset of the population.
21.22: Parameter-vs-Statistic
The measurable quality or characteristic is called a Population Parameter if it is computed from the population. It is called a Sample Statistic if it is computed from a sample.
21.23: Census
The process of conducting a survey to collect data for the entire population is called a census.
21.24: Sample-Survey
The process of conducting a survey to collect data for a sample is called a sample survey.
21.25: Statistical-Inference
Statistics uses data from a sample to make estimates and test hypotheses about the characteristics of a population through a process referred to as statistical inference.
21.26: Analytics
Analytics is the scientific process of transforming data into insight for making better decisions.
21.27: Descriptive-Analytics
Descriptive analytics encompasses the set of analytical techniques that describe what has happened in the past.
21.28: Predictive-Analytics-301
Predictive analytics consists of analytical techniques that use models constructed from past data to predict the future or to assess the impact of one variable on another.
21.29: Prescriptive-Analytics
Prescriptive analytics is the set of analytical techniques that yield a best course of action.
21.30: Big-Data
Larger and more complex data sets are now often referred to as big data.
21.31: Data-Mining-301
Data Mining deals with methods for developing useful decision-making information from large databases. It can be defined as the automated extraction of predictive information from (large) databases.
22.1: Frequency-Distribution
A frequency distribution is a tabular summary of data showing the number (frequency) of observations in each of several non-overlapping categories or classes.
22.2: Cross-Tab
A crosstabulation is a tabular summary of data for two variables. It is used to investigate the relationship between them. Generally, one of the variable is categorical.
23.1: Number
A number is a mathematical object used to count, measure, and label. Their study or usage is called arithmetic, a term which may also refer to number theory, the study of the properties of numbers.
23.2: Prime
A prime number is a natural number greater than 1 that is not a product of two smaller natural numbers. A natural number greater than 1 that is not prime is called a ‘composite number.’ 1 is neither a Prime nor a composite, it is a ‘Unit.’ Thus, by definition, Negative Integers and Zero cannot be Prime.
23.3: Parity-Odd-Even
Parity is the property of an integer \(\mathbb{Z}\) of whether it is even or odd. It is even if the integer is divisible by 2 with no remainders left and it is odd otherwise. Thus, -2, 0, +2 are even but -1, 1 are odd. Numbers ending with 0, 2, 4, 6, 8 are even. Numbers ending with 1, 3, 5, 7, 9 are odd.
23.4: Positive-Negative
An integer \(\mathbb{Z}\) is positive if it is greater than zero, and negative if it is less than zero. Zero is defined as neither negative nor positive.
23.5: Mersenne-Primes
Mersenne primes are those prime number that are of the form \((2^n -1)\); that is, \(\{3, 7, 31, 127, \ldots \}\)
23.6: Measures-of-Location
Measures of location are numerical summaries that indicate where on a number line a certain characteristic of the variable lies. Examples of the measures of location are percentiles and quantiles.
23.7: Measures-of-Center
The measures of center are a special case of measures of location. These estimate where the center of a particular variable lies. Most common are Mean, Median, and Mode.
23.8: Mean
Given a data set \({X = \{{x}_1, {x}_2, \ldots, {x}_n\}}\), the mean \({\overline{x}}\) is the sum of all of the values \({{x}_1, {x}_2, \ldots, {x}_n}\) divided by the count \({n}\).
23.9: Median
Median of a population is any value such that at most half of the population is less than the proposed median and at most half is greater than the proposed median.
23.10: Geometric-Mean
The geometric mean \(\overline{x}_g\) is a measure of location that is calculated by finding the n^{th} root of the product of \({n}\) values.
23.11: Mode
The mode is the value that occurs with greatest frequency.
23.12: Percentile
A percentile provides information about how the data are spread over the interval from the smallest value to the largest value. For a data set containing \({n}\) observations, the \(p^{th}\) percentile divides the data into two parts: approximately p% of the observations are less than the \(p^{th}\) percentile, and approximately (100 – p)% of the observations are greater than the \(p^{th}\) percentile.
23.13: Measures-of-Spread
Measures of spread (or the measures of variability) describe how spread out the data values are. Examples are Range, SD, mean absolute deviation, and IQR
23.14: Variance
The variance \(({\sigma}^2)\) is based on the difference between the value of each observation \({x_i}\) and the mean \({\overline{x}}\). The average of the squared deviations is called the variance.
23.15: Standard-Deviation
The standard deviation (\(s, \sigma\)) is defined to be the positive square root of the variance. It is a measure of the amount of variation or dispersion of a set of values.
23.16: Skewness
Skewness \((\tilde{\mu}_{3})\) is a measure of the shape of a data distribution. It is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean.
23.17: Tails
A tail refers to the tapering sides at either end of a distribution curve.
23.18: Kurtosis
Kurtosis \((\tilde{\mu}_{4})\) is a measure of the “tailedness” of the probability distribution of a real-valued random variable. Like skewness, kurtosis describes the shape of a probability distribution. For \({\mathcal{N}}_{(\mu, \, \sigma)}\), kurtosis is 3 and excess kurtosis is 0 (i.e. subtract 3).
23.19: TheSample
A sample of \({n}\) observations given by \({X = \{{x}_1, {x}_2, \ldots, {x}_n\}}\) have a sample mean \({\overline{x}}\) and the sample standard deviation, \({s}\).
23.20: z-Scores
The z-score, \({z_i}\), can be interpreted as the number of standard deviations \({x_i}\) is from the mean \({\overline{x}}\). It is associated with each \({x_i}\). The z-score is often called the standardized value or standard score.
23.21: t-statistic
Computing a z-score requires knowing the mean \({\mu}\) and standard deviation \({\sigma}\) of the complete population to which a data point belongs. If one only has a sample of observations from the population, then the analogous computation with sample mean \({\overline{x}}\) and sample standard deviation \({s}\) yields the t-statistic.
23.22: Chebyshev-Theorem
Chebyshev Theorem can be used to make statements about the proportion of data values that must be within a specified number of standard deviations \({\sigma}\), of the mean \({\mu}\).
23.23: Empirical-Rule
Empirical rule is used to compute the percentage of data values that must be within one, two, and three standard deviations \({\sigma}\) of the mean \({\mu}\) for a normal distribution. These probabilities are Pr(x) 68.27%, 95.45%, and 99.73%.
23.24: Outliers
Sometimes unusually large or unusually small values are called outliers. It is a data point that differs significantly from other observations.
23.25: Covariance
Covariance is a measure of linear association between two variables. Positive values indicate a positive relationship; negative values indicate a negative relationship.
23.26: Correlation-Coefficient
Correlation coefficient is a measure of linear association between two variables that takes on values between -1 and +1. Values near +1 indicate a strong positive linear relationship; values near -1 indicate a strong negative linear relationship; and values near zero indicate the lack of a linear relationship.
24.1: Probability
Probability is a numerical measure of the likelihood that an event will occur. Probability values are always assigned on a scale from 0 to 1. A probability near zero indicates an event is unlikely to occur; a probability near 1 indicates an event is almost certain to occur.
24.2: Random-Experiment
A random experiment is a process that generates well-defined experimental outcomes. On any single repetition or trial, the outcome that occurs is determined completely by chance.
24.3: Sample-Space
The sample space for a random experiment is the set of all experimental outcomes.
24.4: Counting-Rule
Counting Rule for Multiple-Step Experiments: If an experiment can be described as a sequence of \({k}\) steps with \({n_1}\) possible outcomes on the first step, \({n_2}\) possible outcomes on the second step, and so on, then the total number of experimental outcomes is given by \(\{(n_1)(n_2) \cdots (n_k) \}\)
24.5: Tree-Diagram
A tree diagram is a graphical representation that helps in visualizing a multiple-step experiment.
24.6: Factorial
The factorial of a non-negative integer \({n}\), denoted by \(n!\), is the product of all positive integers less than or equal to n. The value of 0! is 1 i.e. \(0!=1\)
24.7: Combinations
Combination allows one to count the number of experimental outcomes when the experiment involves selecting \({k}\) objects from a set of \({N}\) objects. The number of combinations of \({N}\) objects taken \({k}\) at a time is equal to the binomial coefficient \(C_k^N\)
24.8: Permutations
Permutation allows one to compute the number of experimental outcomes when \({k}\) objects are to be selected from a set of \({N}\) objects where the order of selection is important. The same \({k}\) objects selected in a different order are considered a different experimental outcome. The number of permutations of \({N}\) objects taken \({k}\) at a time is given by \(P_k^N\)
24.9: Event
An event is a collection of sample points. The probability of any event is equal to the sum of the probabilities of the sample points in the event. The sample space, \({s}\), is an event. Because it contains all the experimental outcomes, it has a probability of 1; that is, \(P(S) = 1\)
24.10: Complement
Given an event \({A}\), the complement of A (\(A^c\)) is defined to be the event consisting of all sample points that are not in A. Thus, \(P(A) + P(A^{c}) =1\)
24.11: Union
Given two events A and B, the union of A and B is the event containing all sample points belonging to A or B or both. The union is denoted by \(A \cup B\)
24.12: Intersection
Given two events A and B, the intersection of A and B is the event containing the sample points belonging to both A and B. The intersection is denoted by \(A \cap B\)
24.13: Mutually-Exclusive
Two events are said to be mutually exclusive if the events have no sample points in common. Thus, \(A \cap B = 0\)
24.14: Conditional-Probability
Conditional probability is the probability of an event given that another event already
24.14: Conditional-Probability
occurred. The conditional probability of ‘A given B’ is \(P(A|B) = \frac{P(A \cup B)}{P(B)}\)
24.15: Events-Independent
Two events A and B are independent if \(P(A|B) = P(A) \quad \text{OR} \quad P(B|A) = P(B) \Rightarrow P(A \cap B) = P(A) \cdot P(B)\)
25.1: Random-Variable
A random variable is a numerical description of the outcome of an experiment. Random variables must assume numerical values. It can be either ‘discrete’ or ‘continuous.’
25.2: Discrete-Random-Variable
A random variable that may assume either a finite number of values or an infinite sequence of values such as \(0, 1, 2, \dots\) is referred to as a discrete random variable. It includes factor type i.e. Male as 0, Female as 1 etc.
25.3: Continuous-Random-Variable
A random variable that may assume any numerical value in an interval or collection of intervals is called a continuous random variable. It is given by \(x \in [n, m]\). If the entire line segment between the two points also represents possible values for the random variable, then the random variable is continuous.
25.4: Probability-Distribution
The probability distribution for a random variable describes how probabilities are distributed over the values of the random variable.
25.5: Probability-Function
For a discrete random variable x, a probability function \(f(x)\), provides the probability for each value of the random variable.
25.6: Expected-Value-Discrete
The expected value, or mean, of a random variable is a measure of the central location for the random variable. i.e. \(E(x) = \mu = \sum xf(x)\)
25.7: Variance-Discrete
The variance is a weighted average of the squared deviations of a random variable from its mean. The weights are the probabilities. i.e. \(\text{Var}(x) = \sigma^2 = \sum \{(x- \mu)^2 \cdot f(x)\}\)
25.8: Bivariate
A probability distribution involving two random variables is called a bivariate probability distribution. A discrete bivariate probability distribution provides a probability for each pair of values that may occur for the two random variables.
26.1: Uniform-Probability-Distribution
Uniform probability distribution is a continuous probability distribution for which the probability that the random variable will assume a value in any interval is the same for each interval of equal length. Whenever the probability is proportional to the length of the interval, the random variable is uniformly distributed.
26.2: Probability-Density-Function
The probability that the continuous random variable \({x}\) takes a value between \([a, b]\) is given by the area under the graph of probability density function \(f(x)\); that is, \(A = \int _{a}^{b}f(x)\ dx\). Note that \(f(x)\) can be greater than 1, however its integral must be equal to 1.
26.3: Normal-Distribution
A normal distribution (\({\mathcal{N}}_{({\mu}, \, {\sigma}^2)}\)) is a type of continuous probability distribution for a real-valued random variable.
26.4: Standard-Normal
A random variable that has a normal distribution with a mean of zero \(({\mu} = 0)\) and a standard deviation of one \(({\sigma} = 1)\) is said to have a standard normal probability distribution. The z-distribution is given by \({\mathcal{z}}_{({\mu} = 0, \, {\sigma} = 1)}\)
27.1: Sampled-Population
The sampled population is the population from which the sample is drawn.
27.2: Frame
Frame is a list of the elements that the sample will be selected from.
27.3: Target-Population
The target population is the population we want to make inferences about. Generally (adn preferably), it will be same as ‘Sampled-Population,’ but it may differ also.
27.4: SRS
A simple random sample (SRS) is a set of \({k}\) objects in a population of \({N}\) objects where all possible samples are equally likely to happen. The number of such different simple random samples is \(C_k^N\)
27.5: Sampling-without-Replacement
Sampling without replacement: Once an element has been included in the sample, it is removed from the population and cannot be selected a second time.
27.6: Sampling-with-Replacement
Sampling with replacement: Once an element has been included in the sample, it is returned to the population. A previously selected element can be selected again and therefore may appear in the sample more than once.
27.7: Random-Sample
A random sample of size \({n}\) from an infinite population is a sample selected such that the following two conditions are satisfied. Each element selected comes from the same population. Each element is selected independently. The second condition prevents selection bias.
27.8: Proportion
A population proportion \({P}\), is a parameter that describes a percentage value associated with a population. It is given by \(P = \frac{X}{N}\), where \({x}\) is the count of successes in the population, and \({N}\) is the size of the population. It is estimated through sample proportion \(\overline{p} = \frac{x}{n}\), where \({x}\) is the count of successes in the sample, and \({N}\) is the size of the sample obtained from the population.
27.9: Point-Estimation
To estimate the value of a population parameter, we compute a corresponding characteristic of the sample, referred to as a sample statistic. This process is called point estimation.
27.10: Point-Estimator
A sample statistic is the point estimator of the corresponding population parameter. For example, \(\overline{x}, s, s^2, s_{xy}, r_{xy}\) sample statics are point estimators for corresponding population parameters of \({\mu}\) (mean), \({\sigma}\) (standard deviation), \(\sigma^2\) (variance), \(\sigma_{xy}\) (covariance), \(\rho_{xy}\) (correlation)
27.11: Point-Estimate
The numerical value obtained for the sample statistic is called the point estimate. Estimate is used for sample value only, for population value it would be parameter. Estimate is a value while Estimator is a function.
27.12: Sampling-Distribution
The sampling distribution of \({\overline{x}}\) is the probability distribution of all possible values of the sample mean \({\overline{x}}\).
27.13: Standard-Error
In general, standard error \(\sigma_{\overline{x}}\) refers to the standard deviation of a point estimator. The standard error of \({\overline{x}}\) is the standard deviation of the sampling distribution of \({\overline{x}}\). It is the indicator of ‘Sampling Fluctuation.’
27.14: Sampling-Error
A sampling error is the difference between a population parameter and a sample statistic.
27.15: Central-Limit-Theorem
Central Limit Theorem: In selecting random samples of size \({n}\) from a population, the sampling distribution of the sample mean \({\overline{x}}\) can be approximated by a normal distribution as the sample size becomes large.
28.1: Interval-Estimate
Because a point estimator cannot be expected to provide the exact value of the population parameter, an interval estimate is often computed by adding and subtracting a value, called the margin of error (MOE), to the point estimate. \(\text{Interval Estimate} = \text{Point Estimate} \pm \text{MOE}_{\gamma}\)
28.2: Confidence-Interval
Confidence interval is another name for an interval estimate. Normally it is given as \(({\gamma} = 1 - {\alpha})\). Ex: 95% confidence interval
28.3: Confidence-Coefficient
The confidence level expressed as a decimal value is the confidence coefficient \(({\gamma} = 1 - {\alpha})\). i.e. 0.95 is the confidence coefficient for a 95% confidence level.
28.4: t-distribution
When \({s}\) is used to estimate \({\sigma}\), the margin of error and the interval estimate for the population mean are based on a probability distribution known as the t distribution.
28.5: Degrees-of-Freedom
The number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary. In general, the degrees of freedom of an estimate of a parameter are \((n - 1)\).
29.1: Hypothesis-Testing
Hypothesis testing is a process in which, using data from a sample, an inference is made about a population parameter or a population probability distribution.
29.2: Hypothesis-Null
Null Hypothesis \((H_0)\) is a tentative assumption about a population parameter. It is assumed True, by default, in the hypothesis testing procedure.
29.3: Hypothesis-Alternative
Alternative Hypothesis \((H_a)\) is the complement of the Null Hypothesis. It is concluded to be True, if the Null Hypothesis is rejected.
29.4: Hypothesis-1T-Lower-Tail
\(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\mu} \geq {\mu}_0 \iff {H_a}: {\mu} < {\mu}_0\)
29.5: Hypothesis-1T-Upper-Tail
\(\text{\{Right or Upper\} } {H_0} : {\mu} \leq {\mu}_0 \iff {H_a}: {\mu} > {\mu}_0\)
29.6: Hypothesis-2T-Two-Tail
\(\text{\{Two Tail Test \} } \thinspace {H_0} :{\mu} = {\mu}_0 \iff {H_a}: {\mu} \neq {\mu}_0\)
29.7: Error-Type-I
The error of rejecting \({H_0}\) when it is true, is Type I error \(({\alpha})\).
29.8: Error-Type-II
The error of accepting \({H_0}\) when it is false, is Type II error \(({\beta})\).
29.9: Level-of-Significance
The level of significance \((\alpha)\) is the probability of making a Type I error when the null hypothesis is true as an equality.
29.10: Significance-Tests
Applications of hypothesis testing that only control for the Type I error \((\alpha)\) are called significance tests.
29.11: Test-Statistic
Test statistic is a number calculated from a statistical test of a hypothesis. It shows how closely the observed data match the distribution expected under the null hypothesis of that statistical test. It helps determine whether a null hypothesis should be rejected.
29.12: Tailed-Test
A one-tailed test and a two-tailed test are alternative ways of computing the statistical significance of a parameter inferred from a data set, in terms of a test statistic.
29.13: One-Tailed-Test
One-tailed test is a hypothesis test in which rejection of the null hypothesis occurs for values of the test statistic in one tail of its sampling distribution.
29.14: 1s-known-sd
If \({\sigma}\) is known, the standard normal random variable \({z}\) is used as test statistic to determine whether \({\overline{x}}\) deviates from the hypothesized value of \({\mu}\) enough to justify rejecting the null hypothesis. Refer equation (29.1) \(\to z = \frac{\overline{x} - {\mu}_0}{{\sigma}_{\overline{x}}} = \frac{\overline{x} - {\mu}_0}{{\sigma}/\sqrt{n}}\)
29.15: Approach-p-value
The p-value approach uses the value of the test statistic \({z}\) to compute a probability called a p-value.
29.16: p-value
A p-value is a probability that provides a measure of the evidence against the null hypothesis provided by the sample. The p-value is used to determine whether the null hypothesis should be rejected. Smaller p-values indicate more evidence against \({H_0}\).
29.17: Approach-Critical-Value
The critical value approach requires that we first determine a value for the test statistic called the critical value.
29.18: Critical-Value
Critical value is the value that is compared with the test statistic to determine whether \({H_0}\) should be rejected. Significance level \({\alpha}\), or confidence level (\(1 - {\alpha}\)), dictates the critical value (\(Z\)), or critical limit. Ex: For Upper Tail Test, \(Z_{{\alpha} = 0.05} = 1.645\).
29.19: Acceptance-Region
A acceptance region (confidence interval), is a set of values for the test statistic for which the null hypothesis is accepted. i.e. if the observed test statistic is in the confidence interval then we accept the null hypothesis and reject the alternative hypothesis.
29.20: Margin-Error
The margin of error tells how far the original population means might be from the sample mean. It is given by \(Z\frac{{\sigma}}{\sqrt{n}}\)
29.21: Rejection-Region
A rejection region (critical region), is a set of values for the test statistic for which the null hypothesis is rejected. i.e. if the observed test statistic is in the critical region then we reject the null hypothesis and accept the alternative hypothesis.
29.22: Two-Tailed-Test
Two-tailed test is a hypothesis test in which rejection of the null hypothesis occurs for values of the test statistic in either tail of its sampling distribution.
29.23: Approach-p-value-Steps
p-value Approach: Form Hypothesis | Specify \({\alpha}\) | Calculate test statistic | Calculate p-value | Compare p-value with \({\alpha}\) | Interpret
29.24: 1s-unknown-sd
If \({\sigma}\) is unknown, the sampling distribution of the test statistic follows the t distribution with \((n - 1)\) degrees of freedom. Refer equation (29.3) \(\to t = \frac{{\overline{x}} - {\mu}_0}{{s}/\sqrt{n}}\)
29.25: H-1s-p-Lower
\(\text{\{Left or Lower \} }\space\thinspace {H_0} : {p} \geq {p}_0 \iff {H_a}: {p} < {p}_0\)
29.26: H-1s-p-Upper
\(\text{\{Right or Upper\} } {H_0} : {p} \leq {p}_0 \iff {H_a}: {p} > {p}_0\)
29.27: H-1s-p-Two
\(\text{\{Two Tail Test \} } \thinspace {H_0} : {p} = {p}_0 \iff {H_a}: {p} \neq {p}_0\)
29.28: Power
The probability of correctly rejecting \({H_0}\) when it is false is called the power of the test. For any particular value of \({\mu}\), the power is \(1 - \beta\).
29.29: Power-Curve
Power Curve is a graph of the probability of rejecting \({H_0}\) for all possible values of the population parameter \({\mu}\) not satisfying the null hypothesis. It provides the probability of correctly rejecting the null hypothesis.
30.1: Independent-Simple-Random-Samples
Let \({\mathcal{N}}_{({\mu}_1, \, {\sigma}_1)}\) and \({\mathcal{N}}_{({\mu}_2, \, {\sigma}_2)}\) be the two populations. To make an inference about the difference between the means \(({\mu}_1 - {\mu}_2)\), we select a simple random sample of \({n}_1\) units from population 1 and a second simple random sample of \({n}_2\) units from population 2. The two samples, taken separately and independently, are referred to as independent simple random samples.
30.2: H-2s-Lower
\(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\mu}_1 - {\mu}_2 \geq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 < {D_0}\)
30.3: H-2s-Upper
\(\text{\{Right or Upper\} } {H_0} : {\mu}_1 - {\mu}_2 \leq {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 > {D_0}\)
30.4: H-2s-Two
\(\text{\{Two Tail Test \} } \thinspace {H_0} : {\mu}_1 - {\mu}_2 = {D_0} \iff {H_a}: {\mu}_1 - {\mu}_2 \neq {D_0}\)
30.5: Shapiro-Wilk-Test
The Shapiro-Wilk test is a test of normality. It tests the null hypothesis that a sample came from a normally distributed population. \(P_{\text{shapiro}} > ({\alpha} = 0.05) \to \text{Data is Normal}\). Avoid using sample with more than 5000 observations.
30.6: Independent-Sample-Design-Example
Independent sample design: A simple random sample of workers is selected and each worker in the sample uses method 1. A second independent simple random sample of workers is selected and each worker in this sample uses method 2.
30.7: Matched-Sample-Design-Example
Matched sample design: One simple random sample of workers is selected. Each worker first uses one method and then uses the other method. The order of the two methods is assigned randomly to the workers, with some workers performing method 1 first and others performing method 2 first. Each worker provides a pair of data values, one value for method 1 and another value for method 2.
30.8: Hypo-Paired-Two
\(\text{\{Two Tail Test \} } \thinspace {H_0} : {\mu}_d = 0 \iff {H_a}: {\mu}_d \neq 0\)
30.9: H-2s-p-Lower
\(\text{\{Left or Lower \} }\space\thinspace {H_0} : {p}_1 - {p}_2 \geq 0 \iff {H_a}: {p}_1 - {p}_2 < 0\)
30.10: H-2s-p-Upper
\(\text{\{Right or Upper\} } {H_0} : {p}_1 - {p}_2 \leq 0 \iff {H_a}: {p}_1 - {p}_2 > 0\)
30.11: H-2s-p-Two
\(\text{\{Two Tail Test \} } \thinspace {H_0} : {p}_1 - {p}_2 = 0 \iff {H_a}: {p}_1 - {p}_2 \neq 0\)
31.1: Distribution-Chi-Square
Whenever a simple random sample of size \({n}\) is selected from a normal population, the sampling distribution of \(\frac{(n-1)s^2}{{\sigma}^2}\) is a chi-square distribution with \({n - 1}\) degrees of freedom.
31.2: H-1s-Var-Lower
\(\text{\{Left or Lower \} }\space\thinspace {H_0} : {\sigma}^2 \geq {{\sigma}_0^2} \iff {H_a}: {\sigma}^2 < {{\sigma}_0^2}\)
31.3: H-1s-Var-Upper
\(\text{\{Right or Upper\} } {H_0} : {\sigma}^2 \leq {{\sigma}_0^2} \iff {H_a}: {\sigma}^2 > {{\sigma}_0^2}\)
31.4: H-1s-Var-Two
\(\text{\{Two Tail Test \} } \thinspace {H_0} : {\sigma}^2 = {{\sigma}_0^2} \iff {H_a}: {\sigma}^2 \neq {{\sigma}_0^2}\)
31.5: Distribution-F
Whenever independent simple random samples of sizes \({n}_1\) and \({n}_2\) are selected from two normal populations with equal variances \(({\sigma}_1^2 = {\sigma}_2^2)\), the sampling distribution of \(\frac{{s}_1^2}{{s}_2^2}\) is an F distribution with \(({n}_1 - 1)\) degrees of freedom for the numerator and \(({n}_2 - 1)\) degrees of freedom for the denominator.
31.6: H-2s-Var-Lower
\(\text{\{Left or Lower \} }\space\thinspace \text{Do not do this.}\)
31.7: H-2s-Var-Upper
\(\text{\{Right or Upper\} } {H_0} : {\sigma}_1^2 \leq {\sigma}_2^2 \iff {H_a}: {\sigma}_1^2 > {\sigma}_2^2\)
31.8: H-2s-Var-Two
\(\text{\{Two Tail Test \} } \thinspace {H_0} : {\sigma}_1^2 = {\sigma}_2^2 \iff {H_a}: {\sigma}_1^2 \neq {\sigma}_2^2\)
32.1: H-3p
\(\text{\{Equality of Population Proportions \}} {H_0} : {p}_1 = {p}_2 = \dots = {p}_k \iff {H_a}: \text{Not all population proportions are equal}\)
33.1: Randomization
Randomization is the process of assigning the treatments to the experimental units at random.
33.2: H-ANOVA
\(\text{\{ANOVA\}} {H_0} : {\mu}_1 = {\mu}_2 = \dots = {\mu}_k \iff {H_a}: \text{Not all population means are equal}\)
34.1: Regression-Analysis
Regression analysis can be used to develop an equation showing how two or more variables variables are related.
34.2: Variable-Dependent
The variable being predicted is called the dependent variable \(({y})\).
34.3: Variable-Independent
The variable or variables being used to predict the value of the dependent variable are called the independent variables \(({x})\).
34.4: Simple-Linear-Regression
The simplest type of regression analysis involving one independent variable and one dependent variable in which the relationship between the variables is approximated by a straight line, is called simple linear regression.
34.5: Regression-Model
The equation that describes how \({y}\) is related to \({x}\) and an error term is called the regression model. For example, simple linear regression model is given by equation (34.1)
38.1: Parametric-Methods
Parametric methods are the statistical methods that begin with an assumption about the probability distribution of the population which is often that the population has a normal distribution. A sampling distribution for the test statistic can then be derived and used to make an inference about one or more parameters of the population such as the population mean \({\mu}\) or the population standard deviation \({\sigma}\).
38.2: Distribution-free-Methods
Distribution-free methods are the Statistical methods that make no assumption about the probability distribution of the population.
38.3: Nonparametric-Methods
Nonparametric methods are the statistical methods that require no assumption about the form of the probability distribution of the population and are often referred to as distribution free methods. Several of the methods can be applied with categorical as well as quantitative data.
39.1: Data-Mining-331
Data mining is the process of discovering useful patterns and trends in large data sets.
39.2: Predictive-Analytics-331
Predictive analytics is the process of extracting information from large data sets in order to make predictions and estimates about future outcomes.
40.1: Variable-Flag-Dummy
A flag variable (or dummy variable, or indicator variable) is a categorical variable taking only two values, 0 and 1. Ex: Gender (Male, Female) can be recoded into dummy Gender (Male = 0, Female = 1).
42.1: Multicollinearity
Multicollinearity is a condition where some of the predictor variables are strongly correlated with each other.
42.2: Principle-of-Parsimony
Principle of parsimony is the problem-solving principle that “entities should not be multiplied beyond necessity.”
42.3: Overfitting
Overfitting is the production of an analysis that corresponds too closely to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably.
42.4: Underfitting
Underfitting occurs when a statistical model cannot adequately capture the underlying structure of the data.
42.5: PCA
Principal components analysis (PCA) seeks to explain the correlation structure of a set of predictor variables \({m}\), using a smaller set of linear combinations of these variables, called components \({k}\). PCA acts solely on the predictor variables, and ignores the target variable.
42.6: Eigenvalues
Let \(\mathbf{B}\) be an \(m \times m\) matrix, and let \(\mathbf{I}\) be the \(m \times m\) identity matrix. Then the scalars \(\{\lambda_1, \lambda_2, \ldots, \lambda_m\}\) are said to be the eigenvalues of \(\mathbf{B}\) if they satisfy \(|\mathbf{B} - \lambda \mathbf{I}| = 0\), where \(|\mathbf{Q}|\) denotes the determinant of Q.
42.7: Eigenvector
Let \(\mathbf{B}\) be an \(m \times m\) matrix, and let \({\lambda}\) be an eigenvalue of \(\mathbf{B}\). Then nonzero \(m \times 1\) vector \(\overrightarrow{e}\) is said to be an eigenvector of B, if \(\mathbf{B} \overrightarrow{e} = <U+0001D706>\overrightarrow{e}\).
42.8: Orthogonal
Two vectors are orthogonal if they are mathematically independent, have no correlation, and are at right angles to each other.
42.9: Communality
PCA does not extract all the variance from the variables, but only that proportion of the variance that is shared by several variables. Communality represents the proportion of variance of a particular variable that is shared with other variables. Communality values are calculated as the sum of squared component weights, for a given variable.
43.1: Unsupervised-Methods
In unsupervised methods, no target variable is identified as such. Instead, the data mining algorithm searches for patterns and structures among all the variables. The most common unsupervised data mining method is clustering. Ex: Voter Profile.
43.2: Supervised-Methods
Supervised methods are those in which there is a particular prespecified target variable and the algorithm is given many examples where the value of the target variable is provided. This allows the algorithm to learn which values of the target variable are associated with which values of the predictor variables.
43.3: A-Priori-Hypothesis
An a priori hypothesis is one that is generated prior to a research study taking place.
43.4: Cross-Validation
Cross-validation is a technique for insuring that the results uncovered in an analysis are generalizable to an independent, unseen, data set.
@ref(def:Bias–variance-Trade-off): Bias–variance-Trade-off
Even though the high-complexity model has low bias (error rate), it has a high variance; and even though the low-complexity model has a high bias, it has a low variance. This is known as the bias–variance trade-off. The bias–variance trade-off is another way of describing the overfitting-underfitting dilemma.
43.5: MSE
Mean-Squared error(MSE) is a good evaluating measure of accuracy of model estimation for a continuous target variable. The MSE is a function of the estimation error (sum of squared errors, SSE) and the model complexity (e.g., degrees of freedom). Between two competing models, the one with lower MSE is preferred.
43.6: Resampling
Resampling refers to the process of sampling at random and with replacement from a data set. It is discouraged.
44.1: Clustering
Clustering refers to the grouping of records, observations, or cases into classes of similar objects.
44.2: Cluster
A cluster is a collection of records that are similar to one another and dissimilar to records in other clusters.
44.3: Euclidean-Distance
Euclidean distance between records is given by equation, \(d_{\text{Euclidean}}(x,y) = \sqrt{\sum_i{\left(x_i - y_i\right)^2}}\), where \(x = \{x_1, x_2, \ldots, x_m\}\) and \(y = \{y_1, y_2, \ldots, y_m\}\) represent the \({m}\) attribute values of two records.
44.4: Hierarchical-Clustering
In hierarchical clustering, a treelike cluster structure (dendrogram) is created through recursive partitioning (divisive methods) or combining (agglomerative) of existing clusters.
44.5: Agglomerative-Clustering
Agglomerative clustering methods initialize each observation to be a tiny cluster of its own. Then, in succeeding steps, the two closest clusters are aggregated into a new combined cluster. In this way, the number of clusters in the data set is reduced by one at each step. Eventually, all records are combined into a single huge cluster. mMost computer programs that apply hierarchical clustering use agglomerative methods.
44.6: Divisive-Clustering
Divisive clustering methods begin with all the records in one big cluster, with the most dissimilar records being split off recursively, into a separate cluster, until each record represents its own cluster.
44.7: Single-Linkage
Single linkage, the nearest-neighbor approach, is based on the minimum distance between any record in cluster A and any record in cluster B. Cluster similarity is based on the similarity of the most similar members from each cluster. It tends to form long, slender clusters, which may sometimes lead to heterogeneous records being clustered together.
44.8: Complete-Linkage
Complete linkage, the farthest-neighbor approach, is based on the maximum distance between any record in cluster A and any record in cluster B. Cluster similarity is based on the similarity of the most dissimilar members from each cluster. It tends to form more compact, spherelike clusters.
44.9: Average-Linkage
Average linkage is designed to reduce the dependence of the cluster-linkage criterion on extreme values, such as the most similar or dissimilar records. The criterion is the average distance of all the records in cluster A from all the records in cluster B. The resulting clusters tend to have approximately equal within-cluster variability. In general, average linkage leads to clusters more similar in shape to complete linkage than does single linkage.
45.1: Cluster-Separation
Cluster separation represents how distant the clusters are from each other.
45.2: Cluster-Cohesion
Cluster cohesion refers to how tightly related the records within the individual clusters are. SSE accounts only for cluster cohesion.
45.3: Silhouette
The silhouette is a characteristic of each data value. For each data value i,
45.3: Silhouette
\(\text{Silhouette}_i = s_i = \frac{b_i - a_i}{\text{max}(b_i, a_i)} \to s_i \in [-1, 1]\), where \(a_i\) is the distance between the data value (Cohesion) and its cluster center, and \(b_i\) is the distance between the data value and the next closest cluster center (Separation).
45.4: pseudo-F
The pseudo-F statistic is measures the ratio of (i) the separation between the clusters, as measured by the mean square between the clusters (MSB), to (ii) the spread of the data within the clusters as measured by the mean square error (MSE). i.e. \(F_{k-1, N-k} = \frac{\text{MSB}}{\text{MSE}} = \frac{\text{SSB}/{k-1}}{\text{SSE}/{N-k}}\)
46.1: Affinity-Analysis
Affinity analysis is the study of attributes or characteristics that “go together.” It seeks to uncover rules for quantifying the relationship between two or more attributes. Association rules take the form "If antecedent, then consequent", along with a measure of the support and confidence associated with the rule.
46.2: Support
The support (s) for a particular association rule \(A \Rightarrow B\) is the proportion of transactions in the set of transactions D that contain both antecedent A and consequent B. Support is Symmetric. \(\text{Support} = P(A \cap B) = \frac{\text{Number of transactions containing both A and B}}{\text{Total Number of Transactions}}\)
46.3: Confidence
The confidence (c) of the association rule \(A \Rightarrow B\) is a measure of the accuracy of the rule, as determined by the percentage of transactions in the set of transactions D containing antecedent A that also contain consequent B. Confidence is Asymmetric \(\text{Confidence} = P(B|A) = \frac{P(A \cap B)}{P(A)} = \frac{\text{Number of transactions containing both A and B}}{\text{Total Number of Transactions containing A}}\)
46.4: Itemset
An itemset is a set of items contained in I, and a k-itemset is an itemset containing k items. For example, {Potato, Tomato} is a 2-itemset, and {Potato, Tomato, Onion} is a 3-itemset, each from the vegetable stand set I.
46.5: Itemset-Frequency
The itemset frequency is simply the number of transactions that contain the particular itemset.
46.6: Frequent-Itemset
A frequent itemset is an itemset that occurs at least a certain minimum number of times, having itemset frequency \(\geq \phi\). We denote the set of frequent k-itemsets as \(F_k\).
46.7: A-Priori-Property
a priori property: If an itemset Z is not frequent, then for any item A, \(Z \cup A\) will not be frequent. In fact, no superset of Z (itemset containing Z) will be frequent.
46.8: Lift
Lift is a measure that can quantify the usefulness of an association rule. Lift is Symmetric. \(\text{Lift} = \frac{\text{Rule Confidence}}{\text{Prior proportion of Consequent}}\)
1.1: cannot-open-connection
Error in file(file, ifelse(append, "a", "w")) : cannot open the connection
1.2: need-finite-xlim
Error in plot.window(...) : need finite ’xlim’ values
1.3: par-old-par
Error in par(old.par) : invalid value specified for graphical parameter "pin"
2.1: plot-finite-xlim
Error in plot.window(...) : need finite ’xlim’ values
2.2: Function-Not-Found
Error in arrange(bb, day) : could not find function "arrange"
3.1: Object-Not-Found-01
Error in match.arg(method) : object ’day’ not found
3.2: Comparison-possible
Error in day == 1 : comparison (1) is possible only for atomic and list types
3.3: UseMethod-No-applicable-method
Error in UseMethod("select") : no applicable method for ’select’ applied to an object of class "function"
3.4: Object-Not-Found-02
Error: Problem with mutate() column ... column object ’arr_delay’ not found
22.1: gg-stat-count-geom-bar
Error: stat_count() can only have an x or y aesthetic.
26.1: ggplot-list
Error in is.finite(x) : default method not implemented for type ’list’
26.2: ggplot-data
Error: Must subset the data pronoun with a string.
30.1: shapiro-limit
Error in shapiro.test(...) : sample size must be between 3 and 5000
30.2: t-test-grouping
Error in t.test.formula() : grouping factor must have exactly 2 levels
40.1: Insufficient-Data
Error: Insufficient data values to produce ... bins.
41.1: stat-count-xy
Error: stat_count() can only have an x or y aesthetic.
41.2: stat-count-y
Error: stat_count() must not be used with a y aesthetic.
42.1: CorMat
Error in if (prod(R2) < 0) : missing value where TRUE/FALSE needed